RAG approach to PDF's with tables, help needed

Hello everyone
This is my first post and it’s a question regarding the best way to insert pdf’s with tables into qgrant (can be supabase or pinecone as well) so it keep semantics.
I keep having the same issue with the embedings/chunking not recognizing the tables, so it’s not usable for queries.
I tried diferent approaches, downloading pdf file and inserting directly into qdrant, extract from file before inserting, etc…
Can anyone point into the right direction on whats the best approach to this?
All the videos on youtube regarding RAG workflows on n8n do not approach this at a deeper level.

Information on your n8n setup

  • n8n version: 1.68.0
  • **Database (default: SQLite):**Postgres
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • **Running n8n via (Docker, npm, n8n cloud, desktop app):**docker
  • **Operating system:**linux

Hi @Paulo_Rodrigues, I’m not an expert on this topic but have you tried pre-processing the pdf files to extract table data into something more readable like csv or excel, or using OCR or other third party tools to read tabular data in PDF and then generating embeddings from the processed data?

Also, if you’ve managed to figure this out or have already tried other methods, it would be great if you could share them here :slight_smile:

1 Like

Hi @Paulo_Rodrigues ,

this topic is keeping me busy, too.

Pincecone has an interesting article on this, although you might be disappointed if you are looking for an easy answer. :sweat_smile:

In brief: you should question, whether or not your structured data is appropriate for semantic search - a vector DB is not always the right tool for the right job. But, the article also gives an overview of different ways to use structured data in a semantic search context and illustrates insights with a detailed experiment.

You might also want to have a look at @Jim_Le 's workflow, where he parses bank statement PDFs with Gemini Vision AI to produce markdown documents.

I haven’t tried this yet, but this might be an interesting continuation to consider, i.e. how vector databases handle tables formatted as markdown…

Cheers, Ingo

3 Likes

Hello aya. Thank you for your help. I used Llama parse to solve this issue, but still testing. There is no easy solution. My problem is even bigger because I am working with portuguese texts.

Hello Ingo. Thank you for your help. I am trying Llama parse for this, and I am getting some good results. But still looking for solutions… Cheers

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.