My RAG Agent does not work

Describe the problem/error/question

Hi everyone,

I’m building a RAG agent in n8n for an indoor cycling company. The setup works but has reliability issues I can’t solve through prompting alone, and I’d appreciate input from anyone who’s built production-grade agents.

Setup:

  • Webhook → AI Agent (OpenAI Chat Model) → Edit Fields → Respond to Webhook

  • Tools available to the agent:

    • Airtable (Search) — distributor lookup by country,

    • Pinecone — Ride High magazine content (issues 3–27)

    • SearchAPI (Google, site:body-bike.com) — products and company info

    • HTTP Request — currently disabled via prompt

What works:

  • Airtable lookups are fast and accurate when country matches

  • Pinecone returns relevant magazine content for thematic questions

The issues:

  1. Tool selection is unreliable. Agent sometimes calls 2–3 tools when one would do, or calls the wrong tool entirely (e.g. Google + Airtable for a pure magazine question). Even with explicit prompt rules (“magazine questions → Pinecone only”), it ignores them.

  2. Hallucinated source URLs and dates. When citing Ride High issues, it invents months that don’t match the actual issue URL. Example: writes “Issue 18 January 2023” when the real URL is december-2022. I’ve tried hardcoded date lists in the prompt and explicit examples — still fails.

  3. Source mismatch. Sometimes a distributor answer (from Airtable) ends with a Ride High magazine URL as the citation. The source link doesn’t match the content of the answer.

  4. Hallucinated geographic coverage. When a country has no distributor in Airtable (correct behavior should be “no local distributor, contact HQ”), the agent suggests a nearby distributor instead and invents claims like “covers Benelux including Germany” — which is false.

  5. Model behavior differs. gpt-4o-mini ignores most rules. gpt-4o follows more but isn’t perfect. Latency on 4o is 15–20 seconds which feels high.

What I’ve tried:

  • Hard rules at the top of the system prompt with examples of correct vs. incorrect output

  • Forbidding HTTP Request explicitly

  • Hardcoding the latest issue number

  • Reducing prompt length

My questions:

  1. Is there a more reliable way to enforce tool selection than prompt rules? (Routing layer before the agent?)

  2. How do people handle deterministic post-processing (e.g. fixing dates/URLs) without adding latency?

  3. Is a validator agent worth it given the latency cost, or is there a better pattern?

  4. Any tips for getting consistent source attribution that matches the actual tool that was called?I feel like I’m in a whack-a-mole cycle — every time I fix one issue, a new one appears. Fix tool selection → dates break. Fix dates → source URLs break. Fix sources → hallucinated coverage breaks. Is this just how LLM agents work, or am I missing a fundamental approach that would stabilize the whole thing

  5. I feel like I’m in a whack-a-mole cycle — every time I fix one issue, a new one appears. Fix tool selection → dates break. Fix dates → source URLs break. Fix sources → hallucinated coverage breaks. Is this just how LLM agents work, or am I missing a fundamental approach that would stabilize the whole thing?

Any guidance appreciated — happy to share more details, screenshots, or my system prompt.

Thanks!

What is the error message (if any)?

Please share your workflow

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

yeah ur not in a whack-a-mole loop because of bad prompting, ur in one because single-agent + 4 tools just doesnt stabilize on mini, no matter how much u prompt-engineer it. couple things that actually fix this at the structural level:

the citation hallucinations arent really an LLM problem. the model is constructing dates/urls from its own pretrained knowledge instead of reading them off the pinecone match metadata. fix is to make citations deterministic in n8n, not generated by the model. after ur Pinecone node, drop a Set node that pulls $json["metadata"]["source_url"] and $json["metadata"]["issue_label"] into explicit fields, then inject the retrieved chunks into the system message in a strict format the model has to copy from:

{
  "retrieved": [
    {
      "citation_url": "https://ridehigh.com/issues/december-2022",
      "issue_label": "December 2022 (Issue 17)",
      "content": "..."
    }
  ],
  "rule": "When citing, copy citation_url and issue_label verbatim from retrieved[]. Do NOT construct URLs or dates yourself. If a field is empty, omit the citation."
}

once u take the date/url construction out of the model’s hands the hallucinations stop because there’s nothing for it to confabulate.

for tool selection, ur instinct is right — drop the 4-tool agent and replace with a tiny classifier + Switch. one cheap call (gpt-4o-mini, temp 0, output is one word: magazine | distributor | product | general), Switch on the output, then 3 specialist agents each with ONE tool. mini becomes reliable again when it only has to pick one thing. classifier config:

{
  "model": "gpt-4o-mini",
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "Classify the user query into exactly one category: magazine | distributor | product | general. Output only the category word, nothing else."
    }
  ]
}

also worth flagging — n8n’s openai chat model temperature defaults to 1.0 which is way too high for tool calling. drop ur main agents to 0.2 and a chunk of the cross-tool confusion goes away on its own.

the geographic coverage hallucination is the model confabulating when airtable returns nothing. fix is to NOT have the agent call airtable as a tool at all — do the airtable search in the workflow before the agent runs and pass the result in as context. if empty, inject "distributor_status": "none" and give the model a hard rule: “if status is none, respond: no local distributor for this country, please contact HQ”. workflow gates the answer, llm just renders it. same pattern works for any retrieval where “no result” is a valid outcome.

validator agent is a band-aid imo. if u restructure as above the failure modes get eliminated, not caught. way better roi than paying 5s extra per response to chase symptoms u shouldnt be producing in the first place.

Welcome @Julius_Hessellund to our community! I’m Jay and I am a n8n verified creator.

The tool selection issue is a known weakness with the default agent pattern - letting the model decide which tools to call doesn’t work reliably. A routing layer fixes this: add a first LLM call that classifies the query into categories (distributor_lookup / magazine_content / product_info) and returns a structured JSON, then use a Switch node to route to a dedicated branch for each category where only the relevant tool is available. This eliminates wrong-tool calls entirely since each branch has no other options.

For the URL/source mismatch - post-process in a Code node that maps the retrieved document IDs or chunk references back to their canonical source URLs from your own lookup table, instead of trusting the model to cite them correctly. Those two changes together should remove most of the whack-a-mole you’re seeing.