Chunking for Vector DB - Best Practices

Guys I’m going through n8n step by step. I came to RAG part. As I understand the most important part of creating knowledgebase is chunking. And how you chunk your data into pieces is what will define your AI’s answer relevance.

Is there any resources on how to do this properly?

Additionally, if all my knowledge is in PDF Files I have to extract text first and store text chunks in the Vector DB right?

Hi,

I used these references:

and

While chunk size is critcal, proper evaluation (like anything that’s ML/statistical model/etc) is important as well. Very easy to see great results for small datasets, but accuracy may degrade when you scale up. A scheme like reranking may be necessary: maybe, this Document-based AI Chatbot with RAG, OpenAI and Cohere Reranker | n8n workflow template

In any ML or statistical modeling class, you learn the same common phrase – i.e., “depends on the data”. Chunk sizing is no different, as it also depends on your data. You can try a few values that is common– the doc says 200 to 500 tokens. As always, proper evaluation is needed.

Experiment with tuning chunk size to start, then see how it may not improve accuracy when you scale up, then try more advanced schemes (like reranking).

Yes, text must be extracted from the pdf. PDFs are interesting because you don’t just have text, you have tables/graphs/math formulas/and images as well. With my ingestion pipeline for RAG, I send PDFS through Unstructured for PDFs that are rich with text/images/tables/math. The latter may be overkill if you’re just experimenting and if you’re only interested in text.

Khem

1 Like

Thanks for sharing your knowledge! Reranking sounds interesting, first time I’m hearing about it. Yeah this whole RAG situation is pretty complex after all.