Hello, colleagues
I am working on project PDF Extractor. I need extract data from big technical mixed (drawings, tables, images) PDF files to put it in DOC templates. But so many ways to do it. Or using special services. How do you think what is the best way for now? According to prices, quality, time of developing. Better to extract each page in loop or full pdf file…Probably use some community nodes..
For now my current method is:
Been also messing around PDF to markdown.
Best and Cheap, Slow - self host Docling - but very slow without GPU.
Best and Fast - Image to LLM (GPT-Nano or Gemini 2.5 lite)
My current method is mixture -
Have UI to remove sections I do not want - for these sections I send to LLM.
For full pages - send to Docling
Then put back together based on page numbers.
Experimenting with -
Sending headers with TOC to LLM and get only proper markdown headers setup
Send only pages with tables/graphics to LLM
Convert normal PDF to markdown
Replace via text search of headers for proper markdown header organization.
and spinning up Docling on local pc for as needed access for entire PDF.
Hey brother I don’t know if this is helpful BUT I just want to share. Instead of passing the image to Gemini, if you have the budget for it you should look into LlamaParse. It’s really good at pulling out text & interpreting graphs and flowcharts and turning them into Mermaid Markup. I haven’t had any luck with Gemini.
Thank you. P a r s e E x t r a c t - If I correct understand Table extraction costs is 0.01$ but anyway it looks like cheapest API solution. But my target for now is create smart data extraction from PDF using classification data for every page automatically for documents with mixed data and extract every page using different relevant tools. (not overpay, not lose data).