Split large pdf files with the help of AI agents?

DanielH · December 9, 2024, 10:21pm

Describe the problem/error/question

We have large, scanned PDF files containing many different invoices from many sources. Those invoices look different and have varying number of pages. The majority, however, just one or two pages long.

We used to split those files into smaller files, containig only one invoice per file with manually attached QR codes per invoice. But this has proven unreliable.

We are thinking of using n8n and ai agents to split those large files, without having to apply QR codes. We believe that with the ai models today, this should be possible.

The invoices need further processing afterwards, but for the time being we are mostly interested in splitting large PDF files containing multiple invoices into smaller files containing only one invoice each.

Where should we start?
Is there already a template?
What approach is best?
What model to choose?

Any pointers are very welcome.
Dan

n8n · December 9, 2024, 10:21pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

Ingo · December 10, 2024, 9:00am

Hi Daniel,

you can utilize the API of an external tool via HTTP Request node to split PDF files into multiple pages. Sterling-PDF, for example, offers a variety of splitting functions (by sections, chapters, size, count etc.). You can try out the public online version of Sterling-PDF and its API (stirlingpdf.io) or use your own (self-hosted) private instance of Sterling-PDF if data privacy is a concern.

Depending on the desired postprocessing, you could convert these multiple PDF pages into images in order to feed them into a multimodal LLM.

Have a look at @Jim_Le 's workflow, where he parses bank statement PDFs with Gemini Vision AI to produce markdown documents.

The n8n Default Data Loader node also supports the upload and splitting of PDF files into separate pages, if you want to store and analyse your content in a vector database.

@Derek_Cheung demonstrates this in one of his workflows, where he upserts a large PDF into the Qdrant Vector Store.

Cheers, Ingo

DanielH · December 12, 2024, 1:31pm

Thanks Ingo,
This is not about splitting, but about to know where to split.
I know about tools like Sterling-PDF…
In those tools, you must tell it to split the file at a specific place/page.

What I am looking for, is an n8n flow that would tell such a tool where to split.

I.e. I do not know where to split. Only if I open the file, I would be able to see on what pages I would have to split the document. That is, when a new invoice starts.

Exactly that is what I need n8n (and AI?) to solve.

Btw, I know @Jim_Le’s workflow from the n8n Studio Youtube video from Max. That video triggered my request…

Dan

Ingo · December 12, 2024, 4:13pm

Hi Daniel,

did you watch the latest n8n/Qdrant webinar about classification by chance? Maybe, that could be another helpful pointer.

I am just brainstorming here, but following that idea, what if you first uploaded a golden set of representative invoices into a vector store which could then be utilized for classification (invoice: yes/no) based on similarity.

A scanned PDF file would first be split into single pages. These pages would be fed into the classification sub-workflow. In case of a positive response, you would collect the page number in return. With multiple consecutive page numbers, you would be able to determine a range for what to extract from the original file.

Ingo

DanielH · December 12, 2024, 9:21pm

Thanks @Ingo,
I did not know that webinar. I will certainly have a look into it, although I do not (yet?) understand the concept of classification and vector databases.
Time for some learning, I guess.

Perhaps others also have some additional ideas?

Jim_Le · December 13, 2024, 11:35am

Hey @DanielH

I would second @Ingo suggestion with splitting the PDF into separate pages (images) and using a vision model to identify the start of each invoice.

Though before going down the embeddings route, I’ll suggest trying something simpler like getting the AI to just tag the company name and/or invoice number for each page. As such, when these change from page to page, you’ll know when you’ve hit a new invoice.

Where should we start?
Build a sturdy PDF-to-Image conversion pipeline. SterlingPDF as mentioned works wonders here I found you do need to give it ample resources for handle a decent load.

Is there already a template?
Nothing specific to your use case but there are quite a few using vision AI as Ingo mentioned. Here’s a rough idea if you didn’t have to deal with scanned PDFs.

What approach is best?
I think this is a real valid use-case for mulitmodal vision capable models. Unlike traditional OCR, you could skip a few steps such as the “converting to text” and “parsing regions” and be able to perform logic over images, tables etc.

Cost-wise, at time of writing and dependent on model and provider, it would be incredibly cheap to run.

What model to choose?
You’ll want specifically a multimodal LLM which can read images. These include OpenAI’s GPT4o, Anthropic’s Claude Sonnet3.5, Google’s Gemini, Mistral’s Pixtral.

For me personally, Google’s Gemini (1.5-pro family) has been the most consistent when it comes to classificationa and extraction of numerical and tabular data and I’m super excited to try Gemini 2.0. The pricing is also some of the lowest among other major players.

The only caveat with Gemini is it has unpredictable safety kill switches built-in which are known to trigger when detecting PII.

agniiva · December 13, 2024, 1:08pm

Your problem with splitting large, scanned PDF invoice files is something I’ve recently tackled, and I might have something that could help you get started.

I was dealing with similar issues of needing accurate text extraction from PDFs. I found that even some popular tools had trouble with OCR accuracy even if they were paid. So, I ended up building my own solution: a free, cloud-hosted microservice specifically for PDF OCR. It might be a useful first step in your workflow. It uses pytesseract and I have found it very efficient. It takes a pdf link, and does ocr beautifully.

I’ve documented the whole project, including how it works and how to use it, on GitHub: https://github.com/agniiva/pdf-ocr-gcp. It might save you some time and effort on the OCR part, allowing you to then focus on splitting those invoices as per your needs.

While my project doesn’t directly address the invoice splitting aspect using AI, it could provide a solid foundation by giving you clean, extracted text to work with. You could potentially then feed this extracted text into n8n or your AI agents to determine invoice boundaries.

I’d love to hear your thoughts on how you end up solving the splitting part. Good luck with your project, and I hope my OCR service proves useful!

DanielH · December 31, 2024, 3:48pm

Thanks a lot @Jim_Le ,
Unfortunately, I am lost, even you have provided a possible flow…
Dan

DanielH · December 31, 2024, 3:57pm

Thanks @agniiva ,
I will forward your information to my developer.

My developer is very good, but very AWS serverless oriented anbd pretty conservative when it comes to AI and LLM or LCM.

We already have a traditional solution that uses tesseract in an AWS serverless environment.

Unfortunately, that solution depends on QR codes be stamped on each invoice. Clients tend to forget that stamp. Or the stamp is unreadable (bad dried ink, or other reasons.).

What I try to do here is to give that developer a prove of concept. He currently does not believe that AI is capable to reliably split large documents into smaller documents.

I want to get rid of the split-page-here-QR codes and hoped I could put together something whichg would convince the developer to go that path.

Dan

system · March 31, 2025, 3:57pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.