Extract from File -> Extract from PDF

Sandeep · October 17, 2024, 3:27pm

Hello Team,

I am exploring N8N and i have a PDF file that has an image (there is textual information inside the image) plus normal text information in the PDF file.

While i use the “Extract from File” PDF Operation in the output json, i do get the “normal text information” that is present in the PDF file but the image (there is textual information inside the image) part is skipped which is correct since the PDF operation reads only the text information.

Since my pdf contains pages where i have Image (with text) + regular text and pages where in have regular text only, if it possible for the node to give a notification so that for all the pdf pages that have Image (with text) i can try to use some kind of OCR API to extract text from image.

Can you please suggest a workflow for the same as to how i can achieve this need Images (with text) + regular text using N8N workflow. Thank you for your support.

n8n · October 17, 2024, 3:27pm

It looks like your topic is missing some important information. Could you provide the following if applicable.

n8n version:
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app):
Operating system:

Simon_Coton · October 17, 2024, 3:56pm

hey Sandeep,

I’d suggest something along the lines of what you already alluded to:

Return text from PDF
Submit same PDF to OCR API of sorts to return text from inside an image
Merge the results e.g. by PDF name (so you get all the info together)

As for notifying you when an image is present, what does the “Extract from File” node return when there’s an image e.g. does it tell you there’s an image? or is it not returning anything? If it’s not returning anything, you might be able to add in a step that calls an AI node, uploads the PDF and the AI node returns info on what pages images exist

Sandeep · October 17, 2024, 5:04pm

Hello Simon,

Thank you for a quick response. Regarding your question, the “Extract from File” node does not return anything. If the pdf page has an image (with text inside) plus regular text, it returns only the regular text. So as per your suggestion

Step 1 → Upload PDF to an AI Node that returns info on what pages images exist
Step 2 → Submit those pdf pages only to an OCR API that would return text inside an image
Step 3 → Submit the other pdf pages to existing “Extract from File” PDF Operation to return text from pdf pages
Step 4 → Merge the results from Step 2 + Step 3.

Would it be possible to Split PDF by page(s) so that the needed PDF pages with Images can be sent to the OCR API.

Thank you

system · January 15, 2025, 5:04pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.