Cleaning data format for RAG embedding and retrieval

Hi,

I ‘m trying to build my first RAG agent with n8n, unstructured for data extraction from pdf, and ollama/qdrant for the embedding of the data.

We can find a lot of “cheap” tutorial about how to build a rag agent to make you a billionaire overnight but not a lot about how you can cleanse your data and te expected format to send to your embedding model so I searched on internet and I had a ‘deep’ discussion with different llm models to find clues. This is my workflow :

After doing some tests, I don’t often have good results with it.

During the phase of retrieval, the qdrant vector store always give me a lot of noise (I don’t understant where it comes from). So I increased the nombre of results to fetch from qdrant to increase the probability of good elements (I still don’t have a reranker I know) but th’t just cheap tricks when I increase the number of my documents.

My problem probably comes from the embedding phase. I build a simple pdf file with a single title and some text from internet. Right before the embedding, I have clean content (one column with “pagecontent” and another one with “metadata”). But during the retrieval, I endup with this shit with one word from my whole text page.

type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“62770195-7907-4028-8096-9d31363bf853”} 1 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“112d58ba-7c57-414a-9312-b93512c38d66”} 2 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“89681d9c-02bc-458a-8733-15c5197ef69b”} 3 type:text text:{“pageContent”:“3.2. HyperMesh”,“metadata”:{“source”:“blob”,“blobType”:“application/json”,“line”:10,“loc”:{“lines”:{“from”:1,“to”:1}}},“id”:“87fbc95c-c64f-409a-9604-8512b26a84a3”}

Do you have any idea where my problem is ? Is my data format for embedding wrong ?

Thank you for your help

the issue sounds like it’s coming from the chunking or the qdrant query config, not the embeddings themselves. we had something similar with ollama embeddings — turned out the recursive character splitter was breaking up chunks too aggressively. couple things to check: what’s your chunk size and overlap set to? also try querying qdrant directly to see what it actually stored — if there’s just one word chunks in there, the embedding model is fine, its the data going in. if the stored chunks look good but retrieval still returns fragments, could be a topk issue in your vector store tool.

For PDFs, strip headers, footers, page numbers and repeated cross-page text before embedding. They pollute your vectors badly.

On chunk size: 256-512 tokens with 10-20% overlap for factual retrieval. Ollama context windows are smaller so don’t push past 512 without testing. The real issue is usually chunk quality, not size.

If your PDFs have tables or columns, Unstructured needs hi_res strategy or it mangles the text order. That alone explains most “retrieves correctly but wrong content” bugs.

I choose a chunking size of 1000 and an overlap of 150. When I check on qdrant, my points have a length of 768 (I also did with 512 but still the same, I increased to see better the results in qdrant). The strange thing in qdrant si that some points containt the text :

“content" : “search_document: [SOURCE: Doc1.pdf] [SECTION: 1. contrainte de Von Moses] [PAGE: 1] [REF: 7cec68e4] | TEXTE: Pour certains matériaux, les déformations/contraintes plastiques équivalentes sont écrites sous forme de variables historiques supplémentaires.”“

and others only this (they all have metadata too):

““content” : “Doc1.pdf”“

but both of them are said to be 768.

To avoid document cleanind problems, before this post, in the evaluation process, I created a 2 pages pdf with 2-3 titles and some random text on a subject found on internet to see what I obtained. So I should have clean data in qdrant, but that’s not the case.

I aready use the “hi_res” stategy and I still do some post-processing on the data after ustructured.io, for most of my pdf, my chunks after the embedding are good enough.

My problem probably comes from the embedding, when I check the points on qdrant, this is an exemple of what I get. Some contents are fine, some not realy like I explained above :

{“content”:“search_document: [SOURCE: Doc1.pdf] [SECTION: 1. c…”“metadata”:{“source”:“blob”“blobType”:“application/json”“line”:1“loc”:{“lines”:{“from”:1“to”:1}}}}

And the second strange thing is the metadata fields, they are quite different from the metadata column I provide to the embedding (qdrant node) for each chunk. These are the fields I have as metadata before the embedding node for any chunk :

filename:Doc1 - Copie.pdf
page_number:1
category:NarrativeText
parent_id:ed3ea7edf67e3613684eb088eb8571a1
parent_title:1. Test
chunk_index:0
total_parts:1
original_element_id:c8f5de63b26d5badf8e83f8446d672f7
content_length:381
global_index:1

What I see, is that during the embedding, my “metadata” column for each chunk is breaken down too and not only my “pageContent“ column containing the content of my chunks, leading to a lot of garbage points I see in qdrant.

I am trying to solve this problem.

Update : If in the “Default Document Loader” node I only load the field “pageContent” instead of my whole data :

  {
    "pageContent": "search_document: [SOURCE: Doc1 - Copie.pdf] [SECTION: 1. Test] [PAGE: 1] [REF: c8f5de63] | TEXTE: This is a loooog loooong test to see if my text is correctly identified.",
    "metadata": {
      "filename": "Doc1 - Copie.pdf",
      "page_number": 1,
      "category": "NarrativeText",
      "parent_id": "ed3ea7edf67e3613684eb088eb8571a1",
      "parent_title": "1. Test",
      "chunk_index": 0,
      "total_parts": 1,
      "original_element_id": "c8f5de63b26d5badf8e83f8446d672f7",
      "content_length": 381,
      "global_index": 1
    }
  },

I obtain a clean database in qdran, but then I am still missing all my metadata in qdran, but I still didn’t find how to integrate them, I’m working on this.

I found the answer here : How to add metadata to all chunks when embedding in Pinecone Vector Database - #2 by jennapederson

Metedata fields need to be specified in the in the “Default Document Loader”, you can add options : Metadata fields. This way everything works fine now.

Thanks for your suggestions.