I have a number of documents which need classification. Usual techniques for categorization/classification are not reliable enough regardless of model used or prompting (I would be perfectly ok with having 85-90% accurate classification if only I could know which 15-10% is wrong).
My idea is to build RAG which will ingest already categorized documents and write category into metadata field.
Then, I would send my uncategorized document to the AI and look for the most similar document stored in RAG where I would not be interested in document content but in document category.
Is there something obviously wrong with this approach and is there someone using something like this in his own work? Any ideas or pointers?
Advantage of this approach would be easy addition of new and accurate examples and fairly easy identification of documents which can not be accurately classified.
p.s. in my case I will also use local llm-s since documents in question contain sensitive data (I am also searching for optimal model for this task, currently experimenting with gpt-oss although I suspect there would be better/faster choices)
You’re absolutely on the right track — what you’re describing is a retrieval-based classification system (essentially a nearest-neighbor classifier using semantic embeddings).
It’s a legitimate, explainable, and increasingly popular alternative to black-box LLM classification.
A few enhancements and practical tips worth considering:
Enhancements and Techniques
Top-k voting:
Instead of relying only on the single closest match, retrieve the k most similar documents and assign the category via majority or weighted vote (by cosine similarity).
This usually smooths out noisy embeddings.
Per-category centroids:
Precompute an average embedding for each category and classify new documents by whichever centroid they’re closest to.
It’s simpler and often more stable over time.
Hybrid RAG + LLM:
After retrieving similar documents, prompt your local LLM with something like:
“Given these examples and their categories, which category best fits this new document?”
This lets the LLM reason with explicit evidence while keeping the process explainable.
Metadata filtering:
If you know extra attributes (language, region, time period, etc.), filter retrievals by them before similarity search.
It can significantly improve precision.
Progressive refinement:
If you notice certain edge cases being misclassified, just add those labeled examples to the vector store.
The classifier improves incrementally without retraining.
Limitations to Watch For
Embedding quality is critical — domain-specific text (legal, medical, technical) may need fine-tuned embeddings.
Overlapping or fuzzy categories can blur together in embedding space.
Long documents should be chunked before embedding; use mean or max pooling to form a single vector per document.
Recommended Local Embedding Models
nomic-embed-text – fast and good all-rounder
instructor-xl or gte-large – higher accuracy
bge-large-en – open-source and excellent performance
For local LLM reasoning or explanations:
Mistral 7B, Llama 3 8B, or Mixtral are efficient and solid choices.
Overall, your design is elegant, interpretable, and privacy-friendly — you get an evolving, explainable classifier that can flag uncertain cases automatically.
It’s a smart architecture that many teams are moving toward for sensitive or regulated data.
Thank you, I did not expect to receive compliments
Some remarks toward your remarks and suggestions:
Top K voting is logical extension to the idea which can be derived and likely combined with reranking step, and will likely further improve result accuracy.
I am also thinking that chunking policy for RAG will likely have positive influence on overall accuracy (particulary in top K/reranking).
My issues on which I am worried about is that my documents are not in english, I think that most of models have accuracy issues because of that, while if we move into vector similarity space my llm should be oblivious to the language used.
I also think that with RAG approach my overall speed would improve (my volume is such that I am not really concerned about the speed, but I think in general vector retrieval will always be faster then reasoning, and I can likely use smaller/faster models too).
My other issue is that my documents in between some (not all) categories are quite similar and while there are quite obvious differences to the human observer (most of the content is fairly similar but there are just a couple of vaiables which differentiate between categories), my existing experience with LLMs show that they are easily confused because overll context is almost the same and keyword differences appeare to be insufficient to make solid llm decision and result becomes uncertain (regardless of my usage of one-shot or few-shot prompting attempts).
This is really a problem. I was working on an project to assist nursery students in germany. German language was a big problem especially when there were also Latin words ( like nervus … or msuculus…) Embedding in german language worked but in the end there was alot of data optimisation beforehand to get it working ( poorly to be honest, still trying to figure out a better way). So, yes a lot of challenges, especially when language is not english. Looking forward to see what your solution will look like