Hi,
I ‘m trying to build my first RAG agent with n8n, unstructured for data extraction from pdf, and ollama/qdrant for the embedding of the data.
We can find a lot of “cheap” tutorial about how to build a rag agent to make you a billionaire overnight but not a lot about how you can cleanse your data and te expected format to send to your embedding model so I searched on internet and I had a ‘deep’ discussion with different llm models to find clues. This is my workflow :
After doing some tests, I don’t often have good results with it.
During the phase of retrieval, the qdrant vector store always give me a lot of noise (I don’t understant where it comes from). So I increased the nombre of results to fetch from qdrant to increase the probability of good elements (I still don’t have a reranker I know) but th’t just cheap tricks when I increase the number of my documents.
My problem probably comes from the embedding phase. I build a simple pdf file with a single title and some text from internet. Right before the embedding, I have clean content (one column with “pagecontent” and another one with “metadata”). But during the retrieval, I endup with this shit with one word from my whole text page.
type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“62770195-7907-4028-8096-9d31363bf853”} 1 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“112d58ba-7c57-414a-9312-b93512c38d66”} 2 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“89681d9c-02bc-458a-8733-15c5197ef69b”} 3 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“87fbc95c-c64f-409a-9604-8512b26a84a3”}
Do you have any idea where my problem is ? Is my data format for embedding wrong ?
Thank you for your help