Has anyone built a more advanced RAG system in N8N that also handles images effectively? I’m dealing with detailed manuals (like mechanical engineering documents) where text and images/drawings are intertwined. Traditional RAG setups often focus purely on text, but I need a way to split and embed data (both text and images), possibly via OCR, and then retrieve the images themselves when needed—rather than only converting everything to text or vectors.
Any tips or workflows on how to:
OCR/process these manuals (often in complex PDF formats).
Store/retrieve images (so I can query a manual and get an actual image back).
Keep it all integrated within N8N?
Would really appreciate any insights… I added 3 images to showcase what kind of information retrieval I am mainly talking about. But in general I am talking about more complex elements of a pdf.
Thanks!
Information on your n8n setup
n8n version: 1.69.2
Database (default: SQLite): QLite
n8n EXECUTIONS_PROCESS setting (default: own, main): own, main
Running n8n via (Docker, npm, n8n cloud, desktop app): self-hosted in google cloud
You might want to look into AI vision and the “grounding” approach which has been quite popular in the legal space over the past year (See OrbitalWitness). The idea may be a few steps removed from what I assume you ultimately want but a good starting point:
split document into pages and store pages (as image assets) separately ie. on-disk or via object store
create multivector embeddings for each page asset [^1]
attach the document ref, page ref, page number etc to each relevant embedding as metadata
when retrieving matching results from vector store, use the metadata to fetch the previous stored page asset.
Display page asset to the user
Optional, you can try post-process as needed to extract the specific image which may be tricky.
1Might also make sense to create and manage a separate collection/index for images. ie. If you need to rebuild your text-only vector collection, you’d want to avoid re-vectorising all the images again.
If n8n can implement langchain MultiVectorRetriever which can have vectorstore and docstore and this would easier to create the pipeline to retireve source. Is it possible?
Thanks for the great answer! I’m mainly interested in a solution where, based on the document type, I can easily determine which OCR algorithm or technique to use. Otherwise, if the document type changes slightly and a different solution might be more suitable, I’d need to completely adapt the workflow. I’m looking for something that can be set up to work across multiple use cases, if that makes sense.