I built a literature review assistant in n8n — drops a paper in Drive, auto-extracts metadata, finds related papers on Semantic Scholar + PubMed, and creates a Notion summary

Literature reviews are the part of research nobody enjoys. You download a paper, manually pull out the authors, year, journal, DOI, copy-paste the abstract somewhere, then go hunting for related papers one by one. Multiply that by 50 papers and it’s a week of tedious work.

Built a workflow that handles all of it automatically. Drop a PDF in a Google Drive folder and it does the rest.

What it does

New paper in Drive → extracts all metadata → searches Semantic Scholar + PubMed for related papers → generates APA citation → logs to Google Sheets literature database → creates a Notion summary page

The whole thing takes about 20-30 seconds per paper.

What gets extracted

Paper metadata:

  • Title (exact), all authors (full names), journal name, publication year, DOI

  • Full abstract, keywords, research field

  • Study type: Experimental / Observational / Review / Meta-analysis / Case Study / Qualitative / Mixed Methods / Theoretical

  • Sample size, methodology summary

  • Main findings (numbered list)

  • Conclusions, limitations, future research suggestions

Related papers search:

Uses the paper’s top 3 keywords (or first 5 words of the title if no keywords) to query both Semantic Scholar and PubMed simultaneously. Returns up to 10 results, each with title, year, and citation count. The top 5 get logged to your Sheet.

APA citation:

Auto-generated from extracted metadata in this format:


Author, A., & Author, B. (2024). Title of the paper. Journal Name. https://doi.org/10.xxxx

What lands in Google Sheets

Each row gets: Title, Authors, Year, Journal, DOI, Research Field, Study Type, Sample Size, Keywords, Abstract, Conclusions, APA Citation, Related Papers Found (count), File Link, Added Date

Your entire literature database in one Sheet. Filterable by year, study type, research field.

What lands in Notion

Creates a new page under your chosen parent page with:

  • Full citation header

  • Authors, year, journal

  • Truncated abstract (800 chars) + conclusions

Good for annotation — you can add your own notes directly in Notion after it’s created.

Setup

You’ll need:

  • Google Drive and Sheets (free)

  • Notion account (free)

  • n8n instance (self-hosted — uses PDF Vector community node)

  • PDF Vector account (free tier: 100 credits/month, roughly 20-25 papers)

About 20 minutes to configure.

Download

Workflow JSON:

github.com/khanhduyvt0101/workflows

Full workflow collection:

khanhduyvt0101/workflows


Setup Guide

Step 1: Get your PDF Vector API key

Sign up at https://www.pdfvector.com — free plan works fine. Go to API Keys and generate a key.

Step 2: Create your Google Drive folder

Create a folder called “Research Papers” in Google Drive. Copy the folder ID from the URL (string after /folders/).

Step 3: Set up your Google Sheet

Create a new spreadsheet with these exact headers in Row 1:


Title | Authors | Year | Journal | DOI | Research Field | Study Type | Sample Size | Keywords | Abstract | Conclusions | Citation | Related Papers Found | File Link | Added Date

Copy the Sheet ID from the URL (long string between /d/ and /edit).

Step 4: Set up Notion

In Notion, create a page called “Literature Review” (or whatever you want). Copy the page ID — it’s the last part of the page URL, the 32-character string after the last -.

Connect your Notion account in n8n via the Notion credential node.

Step 5: Import the workflow

Download the JSON from GitHub and import into n8n via Import from File.

Step 6: Configure the nodes

Google Drive Trigger:

  • Connect your Google account

  • Paste your Research Papers folder ID

Download Paper:

  • Same Google credential

PDF Vector - Extract Paper Info:

  • Add new credential (Bearer Token type)

  • Paste your API key

PDF Vector - Find Related Papers:

  • Same PDF Vector credential

  • Uses academic search — queries Semantic Scholar and PubMed automatically

Add to Literature Database:

  • Connect Google Sheets

  • Paste your Sheet ID

  • Sheet tab name should match (default “Sheet1”)

Create Notion Summary:

  • Connect Notion account

  • Paste your parent page ID

Step 7: Test it

Activate the workflow and drop any research paper PDF into your Drive folder. After about 30 seconds check your Sheet — should see a fully populated row. Check Notion for the new summary page.


Accuracy

Tested across papers from medicine, psychology, computer science, and economics.

  • Metadata extraction (title, authors, year, journal): ~98% on digital PDFs

  • Abstract and conclusions: ~95%

  • Study type classification: ~90% — occasionally misclassifies reviews as meta-analyses

  • Keyword extraction: ~92%

  • Related papers search: depends on how niche the topic is — well-indexed fields (medicine, CS) return 10 results consistently; very niche topics may return 3-5

Scanned papers drop to about 85% on metadata accuracy.

Cost

Each paper uses about 4-5 PDF Vector credits (extraction + academic search). Free tier of 100 credits gets you roughly 20-25 papers per month.

Basic plan is $25/month for 3,000 credits if you’re doing a large review.

Customizing it

Change how many related papers are fetched:

In the PDF Vector - Find Related Papers node, change limit: 10 to whatever you want. The Code node currently takes the top 5 for the Notion page.

Search only one database:

In the academic search node, change providers from ["semantic-scholar", "pubmed"] to just one. Useful if your field is primarily indexed in one database.

Add email digest:

Drop a Gmail node at the end to send yourself a daily digest of papers added. Use a scheduled trigger instead of Drive trigger and loop through new Sheet rows.

Skip Notion:

Delete the last node if you don’t use Notion. The workflow works fine without it — everything important is already in Google Sheets.

Add more study types:

Edit the studyType enum in the PDF Vector extraction node schema to add types relevant to your field.


Limitations

  • Requires self-hosted n8n (PDF Vector is a community node)

  • Related paper search quality depends on the paper’s keyword quality — poorly keyworded papers may return irrelevant results

  • Notion page content is text-only (no tables or formatted sections in this version)

  • APA citation is auto-generated and should be verified before use in formal writing

  • Doesn’t handle multi-paper batch uploads — each file triggers individually


Links


Questions? Drop a comment if something’s not working or you want to adjust it for your research workflow.

The manual literature review process is genuinely one of the most expensive time sinks in research workflows — spending an hour per paper on metadata extraction that could be automated is painful when you’re working through 50+ papers.

The parallel Semantic Scholar + PubMed search is smart. Those two APIs have pretty different coverage (PubMed is strong on life sciences, Semantic Scholar covers CS and cross-disciplinary better), so hitting both and merging gives you a much more complete related-work picture than either alone.

One thing I’d be curious about: how does it handle papers where the PDF doesn’t have machine-readable text (older scans, some conference proceedings)? The PDF Vector node presumably handles OCR, but wondering if you’ve hit edge cases where metadata extraction fails or comes back incomplete.

Also — the APA citation auto-gen is a genuinely underrated feature here. That alone saves 5 minutes per paper if you’re maintaining a bibliography.

Hey Derek. :blush: yeah the dual-database thing was key. PubMed kills it for life sciences but misses a ton of CS/engineering papers. Semantic Scholar fills those gaps.

For scanned PDFs - you hit the exact issue. PDF Vector does OCR but accuracy drops hard on older scans. Metadata extraction goes from like 98% to maybe 85%. Worst case is when the scan quality is shit or the formatting is weird (two-column layouts sometimes confuse it).

What I’ve found: if the DOI extracts correctly, you can usually backfill the rest from CrossRef or PubMed APIs. But if it can’t read the DOI… then you’re manually fixing stuff anyway.

Honestly for really old papers I just skip the automation and enter them manually. Not worth debugging bad scans.

The citation auto-gen is clutch though. That 5 minutes adds up fast when you’re processing 50 papers.

What field are you researching in?