I’m using n8n on local machine using dockers. I created a workflow to pick a pdf file from Google Drive and try to extract data using ‘Extract from File’. Unfortunately, when I look at the ‘Table’ tab, it turns out the text is [Empty].
I suppose your pdf isn’t plain text but rather an image converted to a pdf file. There is no text layer so it returns empty. Try an another plain text file to confirm.
If you want you can install the community OCR node: n8n-nodes-tesseractjs which can detect text on images and use it’s output to feed Extract from File node instead. Or you can do it manually as there is sites on the internet that turn these files into actual plain text pdf files.
Login into n8n → Go to settings → Community Nodes → Press install → Type n8n-nodes-tesseractjs
→ Agree to terms and install. You should see the nodes when you refresh n8n. If you are self-hosting you have to restart n8n from whatever panel you are using.
I’m using the Extract from File node in n8n to retrieve content from PDF files.
For most PDF formats, it’s working great and maintains the structure well.
However, for one particular format that i used in the flow, the extracted data gets jumbled and out of order, meaning the text flow is not preserved properly.
Could you please help me with resolving this issue?
Are there any recommended workarounds or alternative approaches
that provide better accuracy and preserve text order?