Comparing two PDFs in n8n without hitting OpenAI rate limits

Hey all,
I’m building an n8n flow where I want to compare two PDF documents (think insurance offers). Right now my flow looks like this:
Take two PDF documents
Run them through Google Cloud OCR → get JSON
Feed both JSON files into an OpenAI agent for semantic comparison
The issue: the OCR output is huge. When I pass the JSON into OpenAI, I keep hitting the rate/context limits.
Has anyone found a good strategy for:
Reducing / compressing OCR output before sending to OpenAI?
Splitting the data into chunks in a way that still allows a meaningful “document vs. document” comparison?
Alternative tools for structured extraction from PDFs (instead of raw OCR → giant text blob)?
I’m currently using Google Cloud OCR (cheap + scalable), but I’m open to switching if there’s a better option.
Any tips, best practices, or examples of similar flows would be super appreciated!

Hey @Quantumfg Raw OCR JSON is too heavy for OpenAI. In n8n, strip it down to plain text in a Function node, then SplitInBatches into ~2k-token chunks before sending along. That keeps you under limits, and chunk-based comparisons still give solid doc-vs-doc results.

Kindly mark as the solution if this helps

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.