Four days ago our production n8n stack silently dropped leads for 27 hours. The SQLite layer underneath n8n failed. The /healthz endpoint returned 200 the entire time. Four webhook callers had .catch(() => {}). Zero log lines. Zero alerts. We found out when a prospect mentioned they’d submitted a form two days earlier and never heard back. That incident forced us to ask: what actually happens in our agent workflows when something fails? For most of them, the answer was “nothing you’d ever notice.”
We audited every workflow. The results were predictable in hindsight:
- 14 had no error handling whatsoever
- 8 had retry logic that treated a 429 rate limit identically to a 401 auth failure
- All of them would silently die on a large tool output (50KB+ web scrape, big DB dump)
- None had any permission layer — agents had full access to every connected tool, always
So we built what we needed. Then open-sourced it.
AgentGuard — three importable sub-workflows, MIT, zero dependencies:
RetryClassifier — 10-class error taxonomy with specific recovery per class:
| Error Class | Recovery Strategy |
|---|---|
| 429 Rate Limit | Parse Retry-After header → wait exact duration |
| 529 Server Overload | Track consecutive → switch model after 3 |
| 400 Context Overflow | Trigger compaction → retry |
| 401/403 Auth Failure | Refresh token → retry once only |
| Network Error | Disable keep-alive → fresh connection |
| Quota Exceeded | Permanent model downgrade |
| Streaming Stall | Abort → retry non-streaming |
| Input Validation | Log + fail (no retry — fix the input) |
Backoff: min(500ms × 2^attempt, 32s) + jitter. When your primary API is down, cascades to local Ollama at $0/call.
ContextBudget — two-tier compaction for agent loops:
Two ways context windows die silently: (1) one large tool result consumes the whole window in a single turn, (2) 20 turns of accumulated history exhausts it.
- Tier 1: truncate oversized tool results to head+tail preview + file reference. Free, runs every turn.
- Tier 2: summarize old turns with cheapest available model (~$0.001). Fires only when needed.
- Circuit breaker prevents infinite compaction loops.
PermissionGate — glob-pattern allow/deny on every tool call before execution:
{ "tool": "bash", "pattern": "rm -rf*", "action": "deny" }
{ "tool": "bash", "pattern": "ls *", "action": "allow" }
{ "tool": "database", "pattern": "DROP ", "action": "deny" }
{ "tool": "git", "pattern": "push", "action": "prompt" }
Three modes: allow (execute + log), deny (block + return error to model), prompt (hold + notify operator + wait for approval). Every decision logged to PostgreSQL, webhook, or file.
How to use:
- Download workflow JSONs from GitHub
- Import into n8n (Settings → Import Workflow)
- Add Execute Workflow node in your agent workflow
- Configure (fallback model, context threshold, permission rules)
Each component is standalone — use one, two, or all three. Works with any LLM provider. n8n 1.70+.
Question for the community: What failure modes have you hit in production that these don’t cover? Specifically curious about multi-agent patterns, webhook-triggered agents, and anything involving file system tools. Want to add recovery strategies for patterns we haven’t seen yet.