PDF Extractor

Hello, colleagues :slight_smile:
I am working on project PDF Extractor. I need extract data from big technical mixed (drawings, tables, images) PDF files to put it in DOC templates. But so many ways to do it. Or using special services. How do you think what is the best way for now? According to prices, quality, time of developing. Better to extract each page in loop or full pdf file…Probably use some community nodes..
For now my current method is:

  1. get pageCount of file from pdf.co

  2. Inside of the loop - split page with pdf.co

  3. First branch goes to convert to base64 and goes to Google OCR

  4. Second branch goes to convert to PNG (pdf.co) and goes to Gemini 2.5 Flash (classification and extraction)

  5. If tables or handwriting exists goes to Table or Handwriting extractor

  6. Merge from branches and merge data from pages

  7. Final extraction specific data with Gemini Flash.

    But I suspect that this is not the best and easiest way..

    What do you think?

  • n8n version 1.106.3
  • Running n8n via (self-hosted)
  • Operating system: Windows 11

Hey @Quantum_Club hope all is good.

Good, fast, cheap - choose any two :slight_smile:

I had good experience with Mistral for extracting data from PDF.

1 Like

Been also messing around PDF to markdown.
Best and Cheap, Slow - self host Docling - but very slow without GPU.
Best and Fast - Image to LLM (GPT-Nano or Gemini 2.5 lite)

My current method is mixture -
Have UI to remove sections I do not want - for these sections I send to LLM.
For full pages - send to Docling
Then put back together based on page numbers.

Experimenting with -
Sending headers with TOC to LLM and get only proper markdown headers setup
Send only pages with tables/graphics to LLM
Convert normal PDF to markdown
Replace via text search of headers for proper markdown header organization.

and spinning up Docling on local pc for as needed access for entire PDF.

1 Like

Thank you for detailed answer. I am going to try some community nodes nearest days..

Hey brother I don’t know if this is helpful BUT I just want to share. Instead of passing the image to Gemini, if you have the budget for it you should look into LlamaParse. It’s really good at pulling out text & interpreting graphs and flowcharts and turning them into Mermaid Markup. I haven’t had any luck with Gemini.

I think its cheaper for nano/lite vs llamaparse and just as good. Also, llama team havent finished thier community node.. If you want to try a service, this was the cheapest ($0.00125 / page) and had good results with tables and headers, you test out via web upload pdf and have api: Convert Webpage to LLM Ready Text, Extract Tables and Structured Data, Parse and OCR PDFs and Images.

2 Likes

Thank you. P a r s e E x t r a c t - If I correct understand Table extraction costs is 0.01$ but anyway it looks like cheapest API solution. But my target for now is create smart data extraction from PDF using classification data for every page automatically for documents with mixed data and extract every page using different relevant tools. (not overpay, not lose data).

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.