PDF Scans > TEXT in Default Data Loader (image pdf) processing

Hi everyone,

I am building a robust agent for loading data from GDRIVE → Vector storage.

Right now I am working on having it handle .PDFs. It loops through text PDFs fine, but seems to get stuck at PDF scans that are image-based and not text based.

I would like it to be able to handle both cases, any suggestions on how to process image PDFs and load their content into the DB with the text splitter?

Hey, looking into the same… did you get it working?

No we decided to wait on image-only PDF support until the next version.

But From what I found so far:

You could build a workflow that identifies if the PDF is text OR image only, and route your workflow based on that check.

We already have the text-based workflow (standard) working.

Image based PDFs (for us we would still need to preserve the formatting information so we technically need both) my plan is to do something like this:

  1. Convert the PDF to images, upload them to cloud (aws).
  2. Optional: OCR the PDF (there are tools for this, recommendation was “tesseract”), then load to vector store as normal.
  3. Combine image scan URLs and other metadata in a sidecar json file for each PDF.

This is specific to our case, you may not need everythign we have. Just know that to find out if a PDF is an image only or text PDF is farily straightforward, i havent tested it yet but will likely add it in a few days. You basically execute a free PDF>Text converter script, and if it returns 0, you know its an image.

Thanks for the tips!
I am now looking into using gdrive as a way to get a more or less universal converter, including ocr.
This is a very nice template Im looking into:

Why dont you try and use Mistral’s new OCR, which can pull images out of pdfs for you next to the text?

I had this problem and found this, which so far seems good but in the process of testing it right now.

A little more complex to set up, so depends on how robust you want your document handling to be.

1 Like

Hey thanks for the tip. Actually I had looked at unstructured and decided too much complexity for the present work…

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.