Extract Data from PDF

Hi,

I’m using n8n on local machine using dockers. I created a workflow to pick a pdf file from Google Drive and try to extract data using ‘Extract from File’. Unfortunately, when I look at the ‘Table’ tab, it turns out the text is [Empty].

Kindly seek anyone assistance to help me out.

Thanks.

I suppose your pdf isn’t plain text but rather an image converted to a pdf file. There is no text layer so it returns empty. Try an another plain text file to confirm.

If you want you can install the community OCR node: n8n-nodes-tesseractjs which can detect text on images and use it’s output to feed Extract from File node instead. Or you can do it manually as there is sites on the internet that turn these files into actual plain text pdf files.

How can I install community OCR node: n8n-nodes-tesseractjs?

Login into n8n → Go to settings → Community Nodes → Press install → Type n8n-nodes-tesseractjs

→ Agree to terms and install. You should see the nodes when you refresh n8n. If you are self-hosting you have to restart n8n from whatever panel you are using.

Edit: There is a lot of community nodes that you don’t usually find in n8n here → keywords:n8n-community-node-package - npm search make sure to check it out too.

Hi everyone,

I’m using the Extract from File node in n8n to retrieve content from PDF files.

For most PDF formats, it’s working great and maintains the structure well.
However, for one particular format that i used in the flow, the extracted data gets jumbled and out of order, meaning the text flow is not preserved properly.

Could you please help me with resolving this issue?
Are there any recommended workarounds or alternative approaches
that provide better accuracy and preserve text order?

Ask this in a new topic so more people can have the chance to help you

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.