Hey everyone! Sharing something I’ve been running in production for a few months — a WhatsApp AI agent built entirely in n8n.
What it does:
- Receives WhatsApp messages via webhook and buffers them (10s window) to handle rapid-fire messages as one unit
- Runs an AI agent (Claude 3.5 Haiku via OpenRouter) with a business-specific system prompt
- Detects urgency via keyword regex OR AI tagging
{handoff} — triggers team notification with full conversation context
- Human takeover mode: team types “unlock” to re-enable the bot after handling
- Follow-up sequences: D+1 and D+3 automatic follow-ups via cron (suppressed if human mode active, opted out, or booking confirmed)
- Automatic fallback: if OpenRouter goes down, switches to Anthropic direct API seamlessly
- Google Calendar integration for real-time appointment scheduling
Architecture highlights:
- Message buffer in PostgreSQL (INSERT → wait 10s → SELECT → DELETE → AI processes aggregated text)
FOR UPDATE SKIP LOCKED prevents duplicate AI responses on concurrent webhooks
- Error monitoring workflow catches DNS/connection failures and alerts via WhatsApp
- All credentials via n8n credential manager — no hardcoded keys
Demo video — dental clinic use case (our pilot client), but the template works for any service business.
I packaged this as a ready-to-import n8n template with full setup guide. Available at zapproai.com — Core ($297) and Pro ($497, adds scheduling + follow-up sequences).
Happy to answer questions about the architecture!
Good catch — that’s actually the weakest point in the current setup, being honest.
Right now: no TTL, no dead-letter. If n8n crashes between INSERT and DELETE, rows sit in wa_msg_buffer indefinitely. The mitigation is a monitoring workflow (separate cron) that scans for messages older than 15 minutes and fires a WhatsApp alert to the team. Cleanup is then manual via a quick DELETE query.
It works fine at 1-4 client scale but I documented it as a known gap for when you scale past ~5 concurrent clients. The proper fix is a created_at column with an automated cleanup cron: delete anything older than 2 minutes that wasn’t processed (implying the webhook died mid-window). A dead-letter table would be cleaner but adds complexity that doesn’t justify itself until you have meaningful volume.
The FOR UPDATE SKIP LOCKED pattern actually helps here too — if the worker crashes after SELECT but before DELETE, the row gets unlocked on session end and the next execution picks it up. Postgres handles that part cleanly.
Update: adding an automated cleanup cron to the monitoring workflow this week — will post the fix here when it’s live.
Hey Benjamin!
We implemented the stuck message detection and auto-cleanup feature you asked about.
The Monitor Central workflow now includes a parallel branch that runs every 30 minutes:
-
CHECK_STUCK_BUFFER — queries wa_msg_buffer for messages older than 5 minutes that weren’t processed
-
Auto-cleanup — uses FOR UPDATE SKIP LOCKED (CTE) to safely delete stuck rows without interfering with active executions
-
Alert — sends a WhatsApp notification with the count and affected chat IDs whenever stuck messages are found and cleaned
This is already included in the Pro template (ZapPro_Monitor_Central.json) — 15 nodes, all in English with placeholder configs.
The workflow now has 14 nodes total (up from 7) and handles both error monitoring AND buffer health in a single cron cycle. No Code node used — all native n8n nodes (Set, IF, Postgres, HTTP Request) to avoid VM2 sandbox timeout issues on some hosting providers.
Let me know if you have any questions on the setup!