Is the AI Agent node too slow for production RAG? Seeing 16s+ response times


Describe the problem/error/question

I’m running a production RAG chatbot using the standard AI Agent + Vector Store Retriever setup and consistently hitting 16-18s response times.

After profiling, the bottleneck isn’t the LLM or the vector DB — it’s the AI Agent node itself. It makes 2-4 internal LLM calls per query (tool selection, reasoning loops, memory handling) before generating the actual answer. For a simple “retrieve context → answer” flow, most of that work is unnecessary.

The actual useful pipeline (embedding + vector search + single LLM call) takes only 3-5s when I bypass the Agent node and use raw HTTP + Code nodes instead.

My questions for the community:

  1. Are you seeing similar latency with the AI Agent node in production RAG setups?

  2. Has anyone found a way to make the AI Agent node faster without bypassing it?

  3. Is the Retrieval QA Chain node faster than the AI Agent for simple RAG? Anyone benchmarked?

  4. For those who ditched the Agent node — what does your pipeline look like?

  5. Does the n8n team have plans to add a lightweight “simple RAG mode” without the reasoning loop?

Workflow A — The slow setup (AI Agent, 16-18s):

Standard setup causing the latency (AI Agent approach):

AI Agent → Vector Store Retriever (Supabase) → OpenAI/Mistral LLM

Workflow B — The fast alternative (HTTP pipeline, 3-5s):

Alternative approach I tested (3-5s response time):

Embed query (HTTP) → Vector search via Supabase RPC (HTTP) → Build prompt (Code) → Call LLM (HTTP) → Parse response (Code)

The output returned by the last node

AI Agent approach: correct answers, but 16-18s latency Raw HTTP approach: same answer quality, 3-5s latency

The difference is entirely due to the hidden LLM calls inside the AI Agent node.

Information on my n8n setup

  • n8n version: 2.9.4

  • Database: PostgreSQL via supabase

  • Running n8n via selfhosted instance Business

Hi @Rodolphe24 :waving_hand:

Really interesting benchmark … thanks for sharing this.

I’ve been seeing something similar, and it matches a point I made in my article:


Applying OOP Principles Inside n8n Code Nodes


My approach is a hybrid one: if I already know the data will be needed, I prepare it before the AI Agent and pass it directly into the prompt. I only rely on the Agent when the tool choice is actually uncertain.

For simple RAG (“retrieve → answer”), this avoids the extra reasoning loops and keeps things much closer to a single LLM call.

Curious if others have compared this with the Retrieval QA Chain or found ways to reduce Agent overhead without bypassing it.

2 Likes

The IBM videos that helped me think about this are “:backhand_index_pointing_right: What is a Vector Database? Powering Semantic Search & AI Applications” and :backhand_index_pointing_right:Is RAG Still Needed? Choosing the Best Approach for LLMs.” They both point toward the same idea: for simple RAG, a clean retrieve → answer pipeline can be a better fit than a full reasoning loop.

That is why I am also seeing good results with a simpler flow: embedding + vector search + one LLM call.

1 Like

hello @Haian_Abou-Karam thanks for you feedback.

What I find frustrating though is the bigger picture: n8n is a low-code platform, and the whole promise of low-code is to ship fast using native nodes. But right now, if you want a RAG chatbot that’s actually production-ready — say, a public-facing assistant on a website — you’re forced to bypass most of the native AI nodes and rebuild everything with HTTP + Code nodes.

That kind of defeats the purpose.

The real challenge isn’t whether we can work around it — clearly we can. It’s that scaling a native n8n AI setup for production use cases shouldn’t require dismantling it. A user who picks the AI Agent + Vector Store Retriever expecting production-grade performance will hit a wall, and there’s no obvious path forward without deep technical workarounds.

I’d love to see n8n bridge that gap — a lightweight RAG mode, or at least an option to skip the reasoning loop when only one tool is attached. That alone would make the native stack viable for real-world, public-facing chatbots.

1 Like

Hello @Rodolphe24

Thanks for the thoughtful reply, I think this is an important discussion for n8n.

I see both sides here:

On one hand, the native AI stack should stay as low-code and production friendly as possible, especially for common RAG use cases.

On the other hand, for some workflows, a hybrid approach with HTTP/Code nodes is still the most practical way to get the performance and control we need.

So for me, the real question is not “native or custom,” but where the line should be between simplicity, flexibility, and performance. That feels like the key topic here, and it is worth exploring openly.

A lightweight RAG mode, or an option to skip reasoning loops when the flow is already deterministic, could be a very interesting direction. At the same time, the Agent still has a clear place when tool choice is genuinely uncertain.

I think this is a fair and valuable conversation for the n8n community, because it touches the real gap between low code promise and production.

1 Like

hi @Rodolphe24
I’ve seen similar latency with the AI Agent in simple RAG setups, and in my experience that usually comes from the agent orchestration itself rather than the vector search or the final LLM call. If the flow is basically retrieve context → answer, I’ve had better results using a Retrieval QA-style setup or a direct HTTP/Code pipeline, and keeping the AI Agent for cases that really need tool selection or multi-step reasoning. I’m not aware of an announced lightweight “simple RAG mode” yet, but I do think that would be a very useful addition for production chatbot use cases.

1 Like

I think this is a very important discussion.

One thing I would add is that RAG should not be the default for every design.

In some cases, if the knowledge base is small enough or the context window is strong enough, a direct prompt with preloaded context can be simpler and more effective.

So I see this as an architectural choice, not a fixed rule:
sometimes RAG is the right solution,
sometimes retrieving data before the Agent is better,
and sometimes long context alone may be enough.

That is why I think software engineering matters even in low-code. The real value is not just using native nodes, but choosing the right design for the problem.

For me, the interesting question is not only how to make the AI Agent faster, but also when the Agent or RAG is not needed at all.

2 Likes