Best Approach to Extract Data from Heterogeneous PDF Invoices Using RAG in n8n

Jose_Alapont_Lujan · May 16, 2025, 3:37pm

Best Approach to Extract Data from Heterogeneous PDF Invoices Using RAG in n8n

n8n version: 1.88.0
Database: Supabase, Postgres
n8n EXECUTIONS_PROCESS setting: -
Running n8n: n8n cloud
Operating system: Windows

Hi everyone,

I’m working on a workflow in n8n to extract structured data from electricity invoices in PDF format. The main challenge is that each energy provider (e.g., Acciona, Feníe, etc.) uses a different layout and structure, so a single prompt or extraction logic doesn’t work well across all invoices. For some providers, the extraction works quite accurately, but for others, the breakdown of prices and quantities creates inconsistencies.

Currently, I use OCR to extract text from the PDFs (OpenAI is handling this part and seems to do it well). The issue arises in the next step, where I pass the text to a RAG-based AI Agent (also using OpenAI). This RAG Agent is responsible for parsing the information into a fixed JSON format, which I use to normalize the output. The resulting structured data is then stored in a Google Sheets template.

However, due to the variety in invoice formats, the agent often struggles to consistently extract the fields I need—especially when units vary, or when field names differ slightly across providers.

I’m wondering:

What’s the best way to “train” or “feed” the RAG Agent to handle these different invoice types effectively?
Does it make sense to vectorize and store sample invoices by provider?
Should I route the extracted text through specific prompts or logic depending on the detected provider?

Any tips, best practices, or examples would be greatly appreciated!

Thanks in advance

Erick_Torres · May 16, 2025, 3:52pm

Hi José!
You’re tackling a great use case — handling heterogeneous PDF invoices is a common but complex automation challenge. Here’s a structured approach to improve the performance of your RAG-based workflow in n8n:

Best Practices for Structured Invoice Extraction with RAG

1. Provider Detection First (Routing Logic)

Before sending the extracted text to the AI agent, detect the provider using a small classification prompt or even a simple regex/keyword-based IF condition.

Why?
Because once you know the provider, you can dynamically:

Load a specific context
Choose a tailored prompt
Route to a specialized sub-agent (if needed)

2. Use Vector Search Only for Few-Shot Examples (not the invoice itself)

Instead of vectorizing the entire invoice content, vectorize 2–3 sample labeled outputs per provider — meaning: example prompt + expected output.

You can use a Supabase vector store or Pinecone and embed these pairs with OpenAI’s text-embedding-ada-002.
When a new invoice arrives, retrieve the top example(s) from the same provider and include that example as context in your RAG prompt.

3. Prompt Engineering by Provider

Structure your prompt like this:

You are a data extraction assistant. Extract the following fields: Provider, Invoice Number, Date, Total Amount, kWh Consumed, and Unit Price. Return the result as JSON.

Here is a sample invoice:
[Insert retrieved few-shot example]

Here is the OCR text:
[Insert actual OCR text from this PDF]

Respond with only valid JSON. Do not hallucinate values. If something is missing, return null.

Use {{provider}} in the System Prompt and User Prompt, and set them dynamically in your workflow using Set or Function nodes based on routing.

Bonus Tips

Use a schema validator after the agent (e.g., ajv via Function node) to ensure valid structure before inserting into Google Sheets.
Store failed extractions and their raw OCR so you can iteratively improve your RAG context.
You can run a fallback flow if key fields like kWh or Total come back as null.

If you want, I can help you set up:

A generic routing template for provider detection
Sample few-shot vector DB logic
An example OpenAI prompt block with variables

Let me know if you’d like that!

King_Samuel_David · May 16, 2025, 7:54pm

Maybe use 1 agent per different invoice, which u can train or use an ai assistance which is traied already?

So instead of feeding the same ai agent two different invoice formats, have 1 each which then u can pass into the rag agent?

That may work better for you.

Jose_Alapont_Lujan · May 17, 2025, 10:34am

Hi, thanks for your answer. And yes, I’d appreciate every help you give me!

I thought in this approach but I do not know if I’ll have 3, 5 or 100 different providers. That’s my concern.

Jose_Alapont_Lujan · May 17, 2025, 10:37am

Hi, thanks for your anwer. I could try this approach, but I’m worried about the different providers because I don’t know if they are 3, 5 or 30.

That’s why I was focusing to build something generic, but it’s neither accurate nor reliable., it works perfectly with one sample but not with another.

Jose_Alapont_Lujan · May 17, 2025, 10:43am

One thing more, guys.

I was wondering what is the best way to teach an agent or an assistant, I haven’t tried because I did not know how, and I do not know if it is really a good and useful point or not (sounds good to me but is it worthy?).

King_Samuel_David · May 17, 2025, 12:00pm

You could try with something like this

from another post, and was looking into this on another post yesterday too, might be a good option to go.
"Just to comment on OCR solutions, I’d highly recommend Google Cloud’s DocumentAI offering. I’ve found the service is be fast with consistent, solid results for any type of scan. The only caveat is that they may have differing pricing for forms (not sure!) but otherwise, incredibly good value for money. I wrote a little about my experience here.

Edit: Also for an alternative approach, I also wrote about a similar task parsing invoices using LlamaParse. I found converting PDF tables to markdown tables allowed the LLM to understand structured data more easily."

Jose_Alapont_Lujan · May 19, 2025, 6:02am

I think the problem is not so much OCR as the interpretation that the AI Agent does after that.

I’m gonna try using different prompts depending on provider but I don’t know if that is a feasible way (cause those invoices can be so different). My fear is that it cannot be carried out.

Sahil_Sharma · May 19, 2025, 6:59am

hey, Jose_Alapont_Lujan
i am analyze your problems it is not so hard, i wll help you share your workflow and i will solve your problems if you give me your sample database

Jose_Alapont_Lujan · May 20, 2025, 10:54am

Sorry, but I can’t share my workflow, only talk about its concepts. Thanks

Jose_Alapont_Lujan · May 20, 2025, 10:58am

I’m thinking on all this.

The approach based on different prompts for every provider is not an option.

I noticed that I was passing my pdfs to text and then passing the result to my Agent. Now I’m filtering this text passing it to a Message Model. Despite of this new way, the Message Model (I’m testing with OpenAI 4O, 4.1 and 4.5-preview) is not reliable cause PDFs are so unestructured.

I’m locked with this, friends

Jose_Alapont_Lujan · May 23, 2025, 6:58am

Hi all,

I spoke with my client to adjust our process. We can’t make this work accurately unless we simplify the heterogeneous nature of the problem.

So now, we’re going to approach it using the common data available in every PDF. The best results came from using the OpenAI 4.5 node to review each PDF and attempt data normalization. However, despite that, several issues still appeared at different stages.

You can close this post, thanks for your help, guys!

ysqander · June 11, 2025, 6:58am

Try Mistral for the OCR part. Then Gemini 2.5 flash (thinking) to extract the fields you are looking for from Mistral output. You need to give gemini a prompt that describe each term you are looking for and tell it to pick the closest one it finds in the input. So for example you may have a total field which can be called Amount Due, Total, Net Amount (description: amount due after all deductions/surcharges or taxes example field names: Amount Due, Total, Net Amount. If you fail to find these precise labels, you must pick the one closest in meaning to the description).

In general try to offload previously deterministic / hardcoded subtasks or paths to reasoning models and see if they can handle it. Some are smart enough to do the work you are trying to do with the rag.

system · September 9, 2025, 6:59am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.