PDF parsing to JSON only extracs the first paragraph

Describe the problem/error/question

I am trying to parse a PDF document in a structured JSON output using LLM. This works well, but it only every parses the first paragraph of the document.

Example: I am trying to Parse: PDF Link into the following format:

{
“title”: “I. Compliance and reporting obligations”,
“section”: “1. General compliance expectation”,
“paragraph_number”: “2.”,
“text”: “Guidelines reflect EBA’s view on appropriate supervisory practices in the ESFS and application of Union law.”
}

In my promt I explicitly tell the LLM (I tried Gemini Pro as well as ChatGPT 4o) to create 20 JSON objects for the first 20 paragraphs. The parsing works, but it stops after outputting the first JSON. What do I have to to in order to create all 20 (or if it starts working all >130 in one go) JSON?

If i input this promt into chatGPT directly it works just fine.

What is the error message (if any)?

Please share your workflow

{
“nodes”: [
{
“parameters”: {
“operation”: “download”,
“fileId”: {
“__rl”: true,
“value”: “={{ $json.id }}”,
“mode”: “id”
},
“options”: {}
},
“id”: “4069217f-22e0-41b7-9b0b-86f3ab2674df”,
“name”: “Download PDF 1 from Google Drive”,
“type”: “n8n-nodes-base.googleDrive”,
“typeVersion”: 3,
“position”: [
180,
40
],
“credentials”: {
“googleDriveOAuth2Api”: {
“id”: “E0r3UzaAAWTYH1dR”,
“name”: “Google Drive account”
}
}
},
{
“parameters”: {
“resource”: “fileFolder”,
“searchMethod”: “query”,
“queryString”: “not name contains ‘outsourcing’”,
“filter”: {
“folderId”: {
“__rl”: true,
“value”: “1nFQIIShX8vUdUflNFNxMMs4ePePSJ08w”,
“mode”: “list”,
“cachedResultName”: “Regulatorikvergleich”,
“cachedResultUrl”: “https://drive.google.com/drive/folders/1nFQIIShX8vUdUflNFNxMMs4ePePSJ08w
},
“whatToSearch”: “files”
},
“options”: {}
},
“type”: “n8n-nodes-base.googleDrive”,
“typeVersion”: 3,
“position”: [
-20,
40
],
“id”: “17ec3382-833f-43e1-a23e-202d512a03f7”,
“name”: “Search files and folders1”,
“credentials”: {
“googleDriveOAuth2Api”: {
“id”: “E0r3UzaAAWTYH1dR”,
“name”: “Google Drive account”
}
}
},
{
“parameters”: {},
“type”: “n8n-nodes-base.manualTrigger”,
“typeVersion”: 1,
“position”: [
-220,
40
],
“id”: “26b00624-75f5-4409-84d4-47ba985b05c2”,
“name”: “When clicking ‘Execute workflow’”
},
{
“parameters”: {
“promptType”: “define”,
“text”: “=Extract data from the attached pdf. Focus on identifying:\n\n\n\n1) Titles with Roman numerals (like ‘Title I - …’),\n\n2) Sections (like ‘1 Proportionality’) and\n\n3) Paragraph numbers (like ‘22.’, ‘23.’, etc.) with their\n\n4) full text content. Preserve the hierarchical structure. Create a JSON for every paragraph in the following structure:\n\n \n\n"title": \n\n"section":\n\n"paragraph_number"\n\n"text" \n\n\n\nAn example would be:\n\n{ \n\n"title": "I. Compliance and reporting obligations", "section": "1. General compliance expectation", \n\n"paragraph_number": "2.", \n\n"text": "Guidelines reflect EBA’s view on appropriate supervisory practices in the ESFS and application of Union law." \n\n}\n\n\n\nStart with the first 20 paragraphs and create 20 JSON objects”,
“hasOutputParser”: true,
“batching”: {}
},
“type”: “@n8n/n8n-nodes-langchain.chainLlm”,
“typeVersion”: 1.7,
“position”: [
600,
40
],
“id”: “8653a3eb-0a29-43df-8d6e-bc77598bedf4”,
“name”: “Basic LLM Chain”,
“retryOnFail”: false,
“onError”: “continueErrorOutput”
},
{
“parameters”: {
“modelName”: “models/gemini-2.5-pro”,
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.lmChatGoogleGemini”,
“typeVersion”: 1,
“position”: [
600,
260
],
“id”: “ac3b3b45-dceb-4970-82ab-bf09ad06276b”,
“name”: “Google Gemini Chat Model”,
“credentials”: {
“googlePalmApi”: {
“id”: “t262Q1qfR4DsPIrr”,
“name”: “Google Gemini(PaLM) Api account”
}
}
},
{
“parameters”: {
“jsonSchemaExample”: “{ \n"title": "I. Compliance and reporting obligations", "section": "1. General compliance expectation", \n"paragraph_number": "2.", \n"text": "Guidelines reflect EBA’s view on appropriate supervisory practices in the ESFS and application of Union law." \n}”
},
“type”: “@n8n/n8n-nodes-langchain.outputParserStructured”,
“typeVersion”: 1.3,
“position”: [
760,
260
],
“id”: “60dbdbfa-429f-4464-9b64-794bb23d9384”,
“name”: “Structured Output Parser”
}
],
“connections”: {
“Download PDF 1 from Google Drive”: {
“main”: [
[
{
“node”: “Basic LLM Chain”,
“type”: “main”,
“index”: 0
}
]
]
},
“Search files and folders1”: {
“main”: [
[
{
“node”: “Download PDF 1 from Google Drive”,
“type”: “main”,
“index”: 0
}
]
]
},
“When clicking ‘Execute workflow’”: {
“main”: [
[
{
“node”: “Search files and folders1”,
“type”: “main”,
“index”: 0
}
]
]
},
“Basic LLM Chain”: {
“main”: [

]
},
“Google Gemini Chat Model”: {
“ai_languageModel”: [
[
{
“node”: “Basic LLM Chain”,
“type”: “ai_languageModel”,
“index”: 0
}
]
]
},
“Structured Output Parser”: {
“ai_outputParser”: [
[
{
“node”: “Basic LLM Chain”,
“type”: “ai_outputParser”,
“index”: 0
}
]
]
}
},
“pinData”: {},
“meta”: {
“templateCredsSetupCompleted”: true,
“instanceId”: “b3c05fcf9e901b5177ce1eb054ce11551b5ac2dcb5e1d188626664a6b4ccbbbc”
}
}

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Hi,
n8n already has a node to extract data from PDF to JSON.
You can using the “Convert to File” node in n8n

You can start with getting the text out of PDF

This way all the text will be in the {{ $json.text }} as it comes into the next node you add to the workflow. This way you are dealing with a simple text instead of the PDF file. Next, you could try to feed it to the Agent or try to chunk it up, if the size of the document exceeds the context window.

Cheers.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.