Requirement out of large PDF files

Can someone help me out?

I am looking for an efficient method to extract all building requirements from a legal contract. This appears to be possible using ChatGPT-4. The challenge, however, lies in the fact that these contracts can vary greatly in length — ranging from around 50 to as many as 1500 pages — and would first need to be divided into smaller documents, each no longer than three pages (since ChatGPT -4o cant handle more in ONE text)

My idea is to create a node in n8n that automatically splits a PDF file into smaller PDF documents of no more than three pages each. Each of these documents would then be sent to the AI (automaticly since this would still take a long time with a document of 1500 pages), which would extract the requirements from one document at a time. These requirements would be stored in an Excel file, with the process repeating for each PDF, consolidating all extracted requirements into the same Excel file.

I apologize for my limited knowledge of n8n, but I hope someone can confirm whether this concept is technically feasible using n8n, and perhaps provide some guidance on how best to implement this process.

Kind regards,

1 Like

Hi @Jop_Van_de_Wiel,

Here’s a workflow to help you get started. You can extract the file to text, then “Chunk” it with a code block. Run each chunk through an agent, to extract all relevent information. From there, you can store it in something like google sheets or keep analyzing it in n8n.

Best,

Robert Breen

Hi Robert,

Thank you for your response.

In the meantime, I’ve been experimenting with building my own workflows. I came to the conclusion that splitting a PDF file into smaller parts can be done efficiently using external tools, so I decided not to build that part into n8n.

Using ChatGPT as support, I created a workflow that processes uploaded PDF documents and extracts relevant requirements. However, I keep running into the following error:

Problem in node ‘Extract Text’: The item has no binary field ‘data’ [item 0]

I get the exact same error when using your workflow. According to ChatGPT, I need to reference the correct field key instead of using the default binary property name, but I can’t seem to find the field key setting anywhere in the form configuration.

I feel like I’m very close to getting it working, but I could really use some guidance on this part. Any help would be greatly appreciated.

Best regards,

Jop

{
“name”: “Extract Requirements from PDFs”,
“nodes”: [
{
“id”: “formTrigger”,
“name”: “Upload PDF(s)”,
“type”: “n8n-nodes-base.formTrigger”,
“typeVersion”: 2,
“position”: [0, 0],
“parameters”: {
“formTitle”: “Upload Multiple PDFs”,
“formFields”: {
“values”: [
{
“fieldLabel”: “Upload PDFs”,
“fieldType”: “file”,
“fieldKey”: “pdfUploads”,
“options”: {
“multiple”: true
}
}
]
}
},
“webhookId”: “upload-multiple-pdfs”
},
{
“id”: “loopFiles”,
“name”: “Loop over files”,
“type”: “n8n-nodes-base.splitInBatches”,
“typeVersion”: 1,
“position”: [250, 0],
“parameters”: {
“batchSize”: 1
}
},
{
“id”: “extractText”,
“name”: “Extract Text”,
“type”: “n8n-nodes-base.extractFromFile”,
“typeVersion”: 1,
“position”: [500, 0],
“parameters”: {
“operation”: “pdf”,
“binaryPropertyName”: “data”
}
},
{
“id”: “sendToAI”,
“name”: “Extract Requirements (AI)”,
“type”: “@n8n/n8n-nodes-langchain.agent”,
“typeVersion”: 1.8,
“position”: [750, 0],
“parameters”: {
“promptType”: “define”,
“text”: “={{ $json.text }}”,
“options”: {
“systemMessage”: “You are a construction assistant. Extract all technical or functional requirements from this document. For each requirement, return: subject, description, paragraph reference (if any), and STABU code (if relevant). Output in JSON format.”
}
}
},
{
“id”: “collect”,
“name”: “Collect Results”,
“type”: “n8n-nodes-base.merge”,
“typeVersion”: 1,
“position”: [950, 0],
“parameters”: {
“mode”: “append”
}
},
{
“id”: “toExcel”,
“name”: “Export to Excel”,
“type”: “n8n-nodes-base.spreadsheetFile”,
“typeVersion”: 1,
“position”: [1150, 0],
“parameters”: {
“operation”: “writeToFile”,
“fileFormat”: “xlsx”,
“dataPropertyName”: “data”,
“options”: {
“includeEmptyRows”: false
}
}
}
],
“connections”: {
“Upload PDF(s)”: {
“main”: [[{“node”: “Loop over files”,“type”: “main”,“index”: 0}]]
},
“Loop over files”: {
“main”: [[{“node”: “Extract Text”,“type”: “main”,“index”: 0}]]
},
“Extract Text”: {
“main”: [[{“node”: “Extract Requirements (AI)”,“type”: “main”,“index”: 0}]]
},
“Extract Requirements (AI)”: {
“main”: [[{“node”: “Collect Results”,“type”: “main”,“index”: 0}]]
},
“Collect Results”: {
“main”: [[{“node”: “Export to Excel”,“type”: “main”,“index”: 0}]]
}
}
}

1 Like

Hi @Jop_Van_de_Wiel,

I’ll look into that.

Can you paste that workflow into a code block so it renders and I can copy it?

Might I ask how i can share my workflow using block code.

1 Like

copy your workflow from n8n, and paste it in the code block here. Here’s an image.

Thanks for the clarification. I think something is wrong with my account since i dont have the same options as you.

Personally I would approach this problem by implementing a RAG system using n8n. This way the document will be chunked (split into pieces), and be searchable using natural language. From this you could then search for the specific parts of the document you need. Using this semantic search you can pin point the requirements you need to extract instead of paging through each page one by one. I might be miss-understanding your requirement, but thats my two cents. Let me know if you need more info or we could discuss your requirement in more detail.

For reference:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.