Openai RAG - how to use a set of files

Describe the problem/error/question

I’ve looked far and wide and didn’t find a solution/info enough for me to understand how to do this.
I have a bunch of files and just want to use openai to inquire this knowledge base. No outside vector db. Just files->openai->chat.
I’ve seen people loading docs to a 3rd party vectordb, and then use that with openai node. But I’d like to just use openai (either assistants api or the new responses api, doesn’t matter to me).

Any tips?

I think maybe we can just use tools sub-nodes to do this but couldn’t figure it out from the docs and examples.

Information on your n8n setup

  • n8n version: 1.80.3
  • Database (default: SQLite): default
  • n8n EXECUTIONS_PROCESS setting (default: own, main): default
  • Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
  • Operating system: Ubuntu 24.04

If you prefer using method like that, it will depends on max token for AI models input and output, simply parse your docs with Extract from File node and fill it on AI System Prompt then you will know how the max token will be crash

The second method using sub node, you must mapping each tools sub node to avoid max token output and define it on system AI Prompts < this method is the same way you upload all files on RAG and mapping it into meta data

1 Like

Hey thanks for helping! The reason I’m avoiding doing my own vectordb is to avoid handling different types of files, like xlsx with multiple sheets or scanned pdf’s. I would like to rely on openai for that (or other provider really).

I can see on the openai sdk docs that we can just upload files of different types and it will create the vectordb.

I was trying to do this through n8n. Or any similar approach where we don’t handle the files ourselves.

ahh i see. For the same kind of situation, I also didn’t get it working in n8n because I still had to create individual logic to parse each file extension. So, I usually use a feature from Flowise where I only need to create one flow to parse all types of files into my pgvector, and the automation of the input still comes from n8n.

It will be interesting if someone already have solution for this case on n8n.

it seems it should be possible but can’t figure it out by myself…

Ive built a solution which uses telegram to receive an invoice pdf, and then processes information about the document like pulling out the total amount and banking details, which I can then use to store in a db etc. Is this what you’re looking for? This does NOT use RAG, but you can build something similar if you need to use RAG with the “Simple Vector Store” sub-node which stores the embedded vectors in server memory.

Simply wrap this in a loop for each file you need to process and use the results as needed.

I used the “Basic LLM Chain” node here. For a more RAG approach you’ll use the “Questions and Answer Chain” node, something like this:

And more recently this node was added which does the same thing:

Hey, thanks for sharing!!
My issue now is in using a set of heterogeneous files, so I can have pdf, scanned pdf, csv, xlsx, etc.
This means we sometimes need to handle files that need ocr first, some are text, some are tables, etc…
My hope was having openai taking care of that. Upload all the files as they are, let openai process…

Yeah unfortunately there is no silver bullet, however it’s not too dificult. When reading a file, look at the mime type you want to support and then use an extract node per type and just route it using a switch node. n8n already makes it easy to extract the text out of different file types. Here is a quick and dirty example of extracting text from an excel and pdf. Im assuming you will be pulling different types of information out of each type of file, so maybe you will have a different LLM node for each file type.

Thats really interesting. Thanks again.

I was going that route until I stumbled on formats that are not availble in n8n, like docx or scanned pdf… that’s why I want to just rely on openai for parsing files and creating the embeds… we can do it with python creating an assistant, and a vectordb, etc. But I can’t find a way to do it in N8N.

My recommendation for scanned PDFs would be to use Mistral’s new OCR capability:

I was trying to figure this out as well. Previously I have worked (in custom code) with the assistants API, that can use multiple files in the built-in vector store of OpenAI. (see https://platform.openai.com/docs/assistants/tools/file-search)

For some reason I can’t get it to work, as soon as i want to upload several files in 1 go, i get thrown errors by the OpenAI file upload node. Have you worked something out already?

I was able to get it to work with Supabase, but i like a bit more minimal setup to be honest.

No I havent figured it out. Even uploading 1 file at a time, how do you then use them all as a single vector store?

I have a proof of concept working, there are a lot of different steps involved, but this flow pulls all attachments from a folder in Google Drive, and uploads them to openAI, then adds them to a vector store, and creates an assistant. Then the vector store is added to the assistant.

Maybe this helps you a bit further as well…

3 Likes

wow thats nice, what a great idea, might have to try implement this myself too ahaa :slight_smile:

this seems to solve my issue! will have to try it out. Thanks!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.