Help: Optimizing a WhatsApp Restaurant Agent (Gemini Flash) – Solving Hallucinations, Latency, and Double-Texting

MAIN WORKFLOW_SEVERIN PLUS.json (71.9 KB)
My restaurant bot (Severin Plus) is experiencing high latency and “double-text” errors. The current architecture is a linear synchronous flow:

  1. WhatsApp Trigger → Duplicate Filter → HTTP Store Status Check → HTTP Typing Indicator → AI Agent (Gemini 1.5 Flash + 6 Tool sub-workflows) → Send Message.

Main Issues:

  • Webhook Retry Bug: Users often have to send a message twice. I believe the linear flow (multiple HTTP requests and tool calls) exceeds Meta’s 5-second webhook timeout, causing it to retry the message because a 200 OK wasn’t sent fast enough.
  • High Latency: Every sub-workflow tool adds execution overhead. The sequential nature makes the bot too slow for live service.
  • Concurrency/Hallucinations: The AI gets confused with simultaneous users. I suspect Simple Memory (5-message window) is failing to handle concurrent sessions reliably.

Request for Guidance: I want to move to a production-ready Worker/Queue pattern. Specifically, I need advice on:

  • Decoupling Response: How to immediately respond with a 200 OK and handle the AI logic asynchronously in the background.
  • Parallelization: How to run the status check and typing indicator simultaneously.
  • Persistent Memory: Best practices for moving from Simple Memory to Postgres/Redis for high concurrency.

Note: My workflow JSON was too large to paste; I have attached the file to this post. Any help would be appreciated!

Hi @Thoth_AI Welcome!

You are right you, just make sure to not exceed Meta’s limits:

I have reviewed your workflow and please do not consider using MULTIPLE sub workflows connected to a single AI agent node, this would really help you decrease the overall latency of your workflow.

Again this links to how you leverage AI into your workflow, as i can see currently it is using AI but at really large scale and calling a lot of different flows which might not cause hallucination in initial runs but will cause it in production when it will be used extensively, and please consider using a proper database like supabase that is really a scalable solution.

How do i ensure so many subworkflows are not connected to a single ai agent.

what strategy can i use to get the best efficiency

@Thoth_AI Consider dividing your sub workflow across multiple AI agents so that each agent gets different type of sub workflows to deal with to insure proper diversion of tasks across AI agents, if possible reducing some sub workflows is also a good take, Also you can use this:

So that if you have some tasks where you just need another AI just use sub AI agent node. This approach would ensure proper diversion of tasks and also it will reduce overhead on one single AI agent which will reduce hallucination risks in production.

session ID is usually the real culprit, not the model. if sessionId isn’t scoped to the WhatsApp number, Simple Memory bleeds context across users. just use {{ $json.from }} as your sessionId and it fixes it straight away.

for double-texting — send the 200 OK back to the webhook immediately, before the agent even starts processing. Meta stops retrying as soon as it gets the 200.

Three production patterns that address all three issues in the title:

1. Double-texting — deduplication is the missing piece

@Pavel_Kuzko is right that sending 200 OK immediately stops Meta’s 5-second timeout from retrying. But if your processing ever fails after the 200, the retry will carry the same message_id — and without a dedup layer you’ll process it twice.

The robust fix: DB-level deduplication keyed on message_id. Early in your workflow, before any AI call:

sql
INSERT INTO message_log (message_id, chat_id, received_at)
VALUES ($1, $2, NOW())
ON CONFLICT (message_id) DO NOTHING
```

Check `rowsAffected`. If `0` → this `message_id` was already processed → exit immediately. If `1` → new message → continue. The `ON CONFLICT DO NOTHING` is atomic, so even if two webhook retries arrive within milliseconds of each other, only one proceeds.

**2. Hallucinations — system prompt guardrails that actually work**

The most reliable pattern is a hard forbidden-phrases list at the top of your system prompt, above everything else:

```
ABSOLUTE GUARDRAIL — NEVER say these unless you are transferring to a human:
- "I'll check and get back to you"
- "Our team will contact you"
- "I'll look into that"
- Any phrase that implies future action you won't take

If you cannot answer right now with available information, say so directly and ask a clarifying question. Do NOT promise follow-up.
```

The key is making the guardrail structural: tie it to a handoff tag. Tell the AI: "if you cannot help, output `{handoff}` in your reply." Then your workflow strips the tag, notifies the human, and the AI never lies to the customer about what it can do.

Context bleed is the other hallucination trigger. If sessionId isn't scoped tightly per user (e.g. `{{ $json.from }}_v1`), Simple Memory mixes up customer contexts and the AI confidently answers with the wrong customer's data. Bumping the version suffix (`_v2`, `_v3`) whenever you change the prompt is also a clean way to wipe stale context for everyone.

**3. Latency — buffer rapid-fire messages, hit AI once**

With 6 tool sub-workflows, the real killer isn't the tools themselves — it's the AI being triggered 3× for "ok / sounds good / what time?" sent as three separate messages. Buffer pattern:

- INSERT each message into a `msg_buffer` table (`chat_id`, `content`, `inserted_at`)
- Wait node: 8–10 seconds
- SELECT all buffered messages for that `chat_id` → concatenate → DELETE buffer → send one combined message to the AI

One AI call + one tool fan-out beats three parallel AI calls every time. The perceived latency actually drops because the user gets one coherent response instead of three partial ones firing out of order.