Sizing a PDF to iteratively fit into the context window of an agent node

N8n version: Version 1.122.5

I’m using the extract from Pdf node to feed a document into an agent. In order for the agent to really process the whole document, I would need to break up the PDF into segments (I was going to try and have it use the table of contents to create segments). However once I’ve successfully pulled out the Table of contents and created a list of the page ranges, I’m still stuck with the PDF to text node only giving me one giant text string (with a 100+ page PDF you can imagine how big) and no pages defined. Is there a way to have the node give me a JSON that has the PDF’s text broken into separate sections by page number? This way I can feed in specific ranges that fit in the models context window. Or some way I can revisit the document I’ve downloaded and use a different node to pull just the pages I want from the document. Any help would be greatly appreciated

What is your ultimate goal here? And where does the PDF come from?
Is it always the same or a changing document?

One approach is to use RAG with a vector store and process the whole PDF this way. You can read about the general concept here: RAG in n8n | n8n Docs
And here you’ll find a example workflow: Ask questions about a PDF using AI | n8n workflow template

If you just want to split a PDF before doing something with it, I’d recommend an API for this kind of thing. There are many available, a free, self-hosted one I use is StirlingPDF: API | Stirling-PDF
Just use the HTTP Node to send the PDF to the API and receive split documents back.

The ultimate goal is to take regulatory documentation from different countries and build a vector store with the data. Traditional chunking though is ineffective as it just selects random blocks of text. My workflow is designed to make more intelligent chunks, add metadata to describe the block, intent and relevance of the info for prioritisation and relevance to different personas. This should make the vector store more effective to search and produce more relevant output. In testing, I’ve been finding that the LLMs using the vector store perform a lot better (with smaller PDFs) that have been chunked in this way, rather than just random text chunks. I’ll look into the StirlindPDF http node approach to see if it will help. Ideally I can still run the chunking workflow for metadata if I can break the large PDFs into separate chunks based on say the table of contents

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.