Hello everyone
This is my first post and it’s a question regarding the best way to insert pdf’s with tables into qgrant (can be supabase or pinecone as well) so it keep semantics.
I keep having the same issue with the embedings/chunking not recognizing the tables, so it’s not usable for queries.
I tried diferent approaches, downloading pdf file and inserting directly into qdrant, extract from file before inserting, etc…
Can anyone point into the right direction on whats the best approach to this?
All the videos on youtube regarding RAG workflows on n8n do not approach this at a deeper level.
Hi @Paulo_Rodrigues, I’m not an expert on this topic but have you tried pre-processing the pdf files to extract table data into something more readable like csv or excel, or using OCR or other third party tools to read tabular data in PDF and then generating embeddings from the processed data?
Also, if you’ve managed to figure this out or have already tried other methods, it would be great if you could share them here
In brief: you should question, whether or not your structured data is appropriate for semantic search - a vector DB is not always the right tool for the right job. But, the article also gives an overview of different ways to use structured data in a semantic search context and illustrates insights with a detailed experiment.
Hello aya. Thank you for your help. I used Llama parse to solve this issue, but still testing. There is no easy solution. My problem is even bigger because I am working with portuguese texts.