Build an Academic Paper Finder with n8n + Semantic Scholar + ArXiv

Literature reviews are painful. You search Google Scholar for one topic, open 20 tabs, check citation counts manually, copy references into your doc one by one, format them in APA, then realize you need BibTeX too. A search that should take 10 minutes turns into 2 hours before you’ve read a single paper.

Built a workflow that searches three academic databases simultaneously, ranks results by citation count, and generates both APA and BibTeX citations automatically — all triggered from a Google Sheet.

What it does

Add search query to Google Sheet → set Status to “Pending” → workflow runs → searches Semantic Scholar + PubMed + ArXiv → ranks by citation count → writes top 10 papers + full citations back to the sheet → status updates to “Completed”

Takes about 15-20 seconds per query.

Databases searched

  • Semantic Scholar — broad coverage across all disciplines

  • PubMed — medical and life sciences

  • ArXiv — physics, mathematics, computer science, AI

How to use it

  1. Open your Google Sheet

  2. Type your search query in column A (e.g., “transformer attention mechanism”, “CRISPR gene editing off-target effects”)

  3. Set Status to “Pending”

  4. Wait about 20 seconds

  5. Results appear in the same row — top papers list, citation counts, APA citations, BibTeX citations

  6. To re-run a query with different results, change Status back to “Pending”

You can queue multiple queries at once. The workflow processes all pending rows each time it runs.

What lands in your Sheet

Each row gets updated with:

  • Results Found — total papers found across all three databases

  • Top Papers — ranked list of top 10 most-cited papers with year and citation count

  • Total Citations — combined citation count across top 10

  • Most Cited — the single highest-cited paper with count

  • APA Citations — ready-to-paste formatted references

  • BibTeX Citations — ready-to-paste .bib entries

  • Search Date

Example output for query “large language model hallucination”:


Top Papers:

1. Survey of Hallucination in Natural Language Generation (2023) - 1,847 citations

2. TruthfulQA: Measuring How Models Mimic Human Falsehoods (2022) - 1,203 citations

3. Language Models (Mostly) Know What They Know (2022) - 891 citations

...

APA Citations:

Ji, S., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys. https://doi.org/10.1145/3571730

BibTeX:

@article{ji2023survey,

title={Survey of hallucination in natural language generation},

author={Ji, S., Lee, N., Frieske, R., ...},

year={2023},

journal={ACM Computing Surveys},

doi={10.1145/3571730}

}

Setup

You’ll need:

  • Google Sheets (free)

  • n8n instance (self-hosted — uses PDF Vector community node)

  • PDF Vector account (free tier: 100 credits/month)

No Gmail or Slack needed — this one runs entirely from a spreadsheet.

About 10 minutes to configure.

Download

Workflow JSON:

academic-paper-finder.json

Full workflow collection:

khanhduyvt0101/workflows


Setup Guide

Step 1: Get your PDF Vector API key

Sign up at pdfvector.com — free plan works for testing. Go to API Keys and generate a key.

Step 2: Create your Google Sheet

Headers in Row 1:


Search Query | Status | Results Found | Top Papers | Total Citations | Most Cited | APA Citations | BibTeX Citations | Search Date

Step 3: Import the workflow

Download the JSON from GitHub and import into n8n via Import from File.

Step 4: Configure the nodes

PDF Vector - Search Papers:

  • Add new credential (Bearer Token)

  • Paste your API key

  • Query, providers, and limit are pre-configured

Read Queries (Google Sheets):

  • Connect Google Sheets account (OAuth2)

  • Paste your Sheet ID

Update Results (Google Sheets):

  • Same credential

  • Same Sheet ID

  • Matches on “Search Query” column to update the correct row

Step 5: Test it

Add a query to your sheet, set Status to “Pending,” and wait about 20 seconds. The row will update with results.


Accuracy

Results depend on the databases, not on PDF extraction — this workflow searches academic APIs rather than parsing uploaded documents.

  • Semantic Scholar: strong coverage for CS, AI, economics, biology, most STEM fields

  • PubMed: highly reliable for medical and life sciences research

  • ArXiv: excellent for preprints in physics, math, CS, and AI — citation counts are lower since preprints aren’t formally cited yet

Citation counts reflect what’s indexed in each database — a paper with 500 Semantic Scholar citations may have more in Google Scholar. ArXiv preprints typically show lower counts than their published versions.

BibTeX keys are auto-generated as [firstauthor][year][firstword] — you may want to standardize them for larger bibliographies.

Customizing it

Adjust number of results:

In the PDF Vector node, limit is set to 20. Increase up to 50 for broader searches, decrease to 5 for quick lookups.

Search only one database:

In the PDF Vector node, remove providers from the array. Use ["semantic-scholar"] alone for general research or ["pubmed"] for clinical topics.

Add a date filter:

In the Process & Format Results node, filter topPapers by paper.year >= 2020 before formatting to return only recent work.

Connect to Notion:

After the Sheets update node, add a Notion node to create a page per query with the full citation list — useful if your research workflow lives in Notion.

Export to .bib file:

Aggregate BibTeX entries across multiple rows and use the Write Binary File node to output a .bib file directly to your Drive.


Limitations

  • Requires self-hosted n8n (PDF Vector is a community node)

  • Search quality depends on database coverage — niche fields may return fewer results

  • APA formatting is standard but may need minor adjustments for edge cases (edited volumes, book chapters, conference papers)

  • ArXiv preprint citations are lower than final published versions

  • No deduplication across databases — the same paper may appear from multiple sources


PDF Vector n8n integration

Full workflow collection

Questions? Drop a comment.

Phenomenal execution on the BibTeX + APA automation — that’s something academic teams desperately need. The note about deduplication is real though — usually where these solutions break at scale.

Did you consider adding a Claude integration for automatic literature summaries? You’ve already got structured PDFs; piping them through an LLM for quick summaries before researchers dive deep could be a game-changer.