Clinical trial documents are some of the most information-dense PDFs out there. A single protocol document can be 80-150 pages. Extracting the structured data — study phase, primary endpoint definition, sample size targets vs actual enrollment, adverse event counts, dosing schedules by arm, inclusion/exclusion criteria — and getting it into a consistent format for comparison across studies takes hours of careful reading per document.
For CROs, pharma regulatory teams, and research coordinators processing multiple trials simultaneously, that manual extraction time compounds fast.
Built a workflow that reads every clinical trial document the moment it lands in Drive and extracts all structured data automatically.
What it does
Trial PDF dropped in Drive → extracts full study data → formats AE counts and endpoint details → logs to trials database → posts structured summary to Slack
About 12-15 seconds per document.
What gets extracted
Study identification:
-
Study ID, protocol number
-
Phase (I / II / III / IV)
-
Sponsor name
-
Indication / disease area
-
Regulatory status
Study design:
-
Design type (randomized, double-blind, placebo-controlled, etc.)
-
Study duration
Endpoints:
-
Primary endpoint — name, definition, measurement method
-
Secondary endpoints — each with name and definition
-
Secondary endpoint count
Enrollment:
-
Target sample size
-
Enrolled count
-
Completed count
Dosing:
- Each arm — dose, frequency, route of administration
Safety:
-
Total adverse events
-
Serious adverse events (SAEs)
-
Most common adverse events listed
Eligibility:
-
Inclusion criteria (as array)
-
Exclusion criteria (as array)
Results:
-
Efficacy results if available
-
Statistical methods used
-
Key findings (as numbered list)
What lands in Slack
🧪 Clinical Trial Document Processed
Study: NCT04823182
Protocol: XYZ-2024-001
Phase: Phase III
Sponsor: Novartis AG
📋 Study Overview:
• Indication: Relapsed/Refractory Multiple Myeloma
• Design: Randomized, double-blind, placebo-controlled
• Duration: 24 months
👥 Sample Size:
• Target: 450
• Enrolled: 447
• Completed: 389
🎯 Primary Endpoint:
Progression-Free Survival (PFS) — time from randomization
to disease progression or death by any cause
📊 Secondary Endpoints: 6
⚠️ Safety:
• Total AEs: 1,243
• Serious AEs: 89
• Common: Fatigue, Nausea, Peripheral neuropathy,
Thrombocytopenia
🔗 View Document
What lands in Google Sheets
Each row: Study ID, Protocol, Phase, Sponsor, Indication, Design, Primary Endpoint, Secondary Endpoints (count), Sample Size, Duration, Total AEs, SAEs, Status, Processed Date
Filter by Phase to compare all Phase III trials. Sort by SAEs to flag safety signals. Filter by Indication to pull all trials in a specific disease area.
Setup
You’ll need:
-
Google Drive (folder for trial PDFs)
-
Google Sheets (free)
-
n8n instance (self-hosted — uses PDF Vector community node)
-
PDF Vector account (free tier: 100 credits/month)
-
Slack (for team notifications)
About 15 minutes to configure. Must run on self-hosted n8n — clinical documents contain sensitive study data that shouldn’t pass through shared infrastructure.
Download
Workflow JSON:
Full workflow collection:
Setup Guide
Step 1: Get your PDF Vector API key
Sign up at pdfvector.com — free plan works for testing.
Step 2: Create Drive folder and Sheet
Folder: “Clinical Trials” — copy folder ID.
Sheet headers:
Study ID | Protocol | Phase | Sponsor | Indication | Design | Primary Endpoint | Secondary Endpoints | Sample Size | Duration | Total AEs | SAEs | Status | Processed Date
Step 3: Import and configure
Download JSON → n8n → Import from File.
New Trial Document (Drive Trigger):
- Connect Google Drive (OAuth2), paste folder ID
Extract Trial Data (PDF Vector):
- Add PDF Vector credential (Bearer Token), paste API key
Log to Sheets:
- Connect Google Sheets, paste Sheet ID
Send to Slack:
- Connect Slack, select your clinical team channel
Accuracy
Tested on Phase II/III protocol synopses, clinical study reports (CSRs), and published trial PDFs from ClinicalTrials.gov.
-
Study ID, phase, sponsor, indication: ~97%
-
Primary endpoint name and definition: ~94%
-
Sample size (target): ~96% — prominently stated in most protocols
-
Actual enrollment vs completed: ~89% — only present in completed study reports
-
Total AEs and SAEs: ~92% — reliable in structured safety tables
-
Common AE list: ~88% — depends on table formatting in the source document
-
Inclusion/exclusion criteria: ~91% on standard protocol formats
-
Key findings: ~86% — best on completed CSRs with explicit results sections
Scanned or OCR-poor PDFs will drop accuracy significantly. Digital PDFs from sponsors or from ClinicalTrials.gov work best.
Cost
3-4 credits per document. Free tier of 100 credits covers roughly 25-30 trial documents per month.
Customizing it
Safety signal flagging:
Add an IF node — if SAEs exceed a threshold (e.g., 10% of total AEs), route to a separate urgent Slack channel for immediate team review.
Cross-trial comparison:
Since each trial gets the same column structure, add a Sheets formula to compare primary endpoints or SAE rates across all Phase III trials in a specific indication.
ClinicalTrials.gov integration:
Add an HTTP Request node to pull additional metadata from the ClinicalTrials.gov API using the extracted Study ID (NCT number) — gets you current status, results posting date, and linked publications automatically.
Regulatory submission tracking:
Add a Status column that gets manually updated (In Review, Submitted, Approved) and build a companion workflow that posts a weekly digest of all trials by regulatory status.
Important note
Clinical trial documents often contain proprietary sponsor data, patient enrollment information, and unreported efficacy data. This workflow must run on a secured, self-hosted n8n instance with appropriate access controls. Do not use on shared or cloud-hosted n8n instances for sensitive trial data.
Questions? Drop a comment.
