Best way to get OCR data in n8n

I need an OCR machine that can extract all text from a PDF file. We often receive files in PDF format that are essentially just images without text, which is why our own “PDF reader” in n8n doesn’t always work. I need it to do that.

It’s about receiving large annual reports with lots of columns in PDF format, and we need data from some of those columns.

For example, like in the image, where we need to use these columns as JSON.

I am pretty new, so explain in it to me, like I’m a child :sweat_smile:

Since it’s an image you need an OCR node as you said. n8n Does not have an official node but there is a community one: n8n-nodes-tesseractjs

After extracting the text from image you can use different methods to filter them. I always use a ‘Code’ node to do so but you can use an AI node too if you don’t mind spending extra tokens.

2 Likes

In my opinion Mistral AI’s OCR tool is the best out there right now. Ideally you would use their API in a HTTP Request node to do your ocr transformation. If the data is sensitive, you can get a special license from them to use a private on prem instance for better privacy

If you need an example then I can build a demo

1 Like

If you just need to turn those image‑based PDFs into actual text/structured data (like JSON), try Mathpix: https://mathpix.com/pdf-conversion. It does OCR on scanned PDFs and can extract tables/columns cleanly, then you can grab the data as JSON for n8n. Way less painful than trying to hack it with a basic PDF reader.

2 Likes