Open-sourced: three n8n sub-workflows for agent reliability — RetryClassifier, ContextBudget, PermissionGate [MIT, free]

Four days ago our production n8n stack silently dropped leads for 27 hours. The SQLite layer underneath n8n failed. The /healthz endpoint returned 200 the entire time. Four webhook callers had .catch(() => {}). Zero log lines. Zero alerts. We found out when a prospect mentioned they’d submitted a form two days earlier and never heard back. That incident forced us to ask: what actually happens in our agent workflows when something fails? For most of them, the answer was “nothing you’d ever notice.”

We audited every workflow. The results were predictable in hindsight:

  • 14 had no error handling whatsoever
  • 8 had retry logic that treated a 429 rate limit identically to a 401 auth failure
  • All of them would silently die on a large tool output (50KB+ web scrape, big DB dump)
  • None had any permission layer — agents had full access to every connected tool, always

So we built what we needed. Then open-sourced it.


AgentGuard — three importable sub-workflows, MIT, zero dependencies:

RetryClassifier — 10-class error taxonomy with specific recovery per class:

Error Class Recovery Strategy
429 Rate Limit Parse Retry-After header → wait exact duration
529 Server Overload Track consecutive → switch model after 3
400 Context Overflow Trigger compaction → retry
401/403 Auth Failure Refresh token → retry once only
Network Error Disable keep-alive → fresh connection
Quota Exceeded Permanent model downgrade
Streaming Stall Abort → retry non-streaming
Input Validation Log + fail (no retry — fix the input)

Backoff: min(500ms × 2^attempt, 32s) + jitter. When your primary API is down, cascades to local Ollama at $0/call.

ContextBudget — two-tier compaction for agent loops:

Two ways context windows die silently: (1) one large tool result consumes the whole window in a single turn, (2) 20 turns of accumulated history exhausts it.

  • Tier 1: truncate oversized tool results to head+tail preview + file reference. Free, runs every turn.
  • Tier 2: summarize old turns with cheapest available model (~$0.001). Fires only when needed.
  • Circuit breaker prevents infinite compaction loops.

PermissionGate — glob-pattern allow/deny on every tool call before execution:

{ "tool": "bash", "pattern": "rm -rf*", "action": "deny" }
{ "tool": "bash", "pattern": "ls *", "action": "allow" }
{ "tool": "database", "pattern": "DROP ", "action": "deny" }
{ "tool": "git", "pattern": "push", "action": "prompt" }

Three modes: allow (execute + log), deny (block + return error to model), prompt (hold + notify operator + wait for approval). Every decision logged to PostgreSQL, webhook, or file.


How to use:

  1. Download workflow JSONs from GitHub
  2. Import into n8n (Settings → Import Workflow)
  3. Add Execute Workflow node in your agent workflow
  4. Configure (fallback model, context threshold, permission rules)

Each component is standalone — use one, two, or all three. Works with any LLM provider. n8n 1.70+.

GitHub: GitHub - genticai-pro/agentguard: Production reliability for n8n AI agents. Drop-in. Zero dependencies. Battle-tested. · GitHub


Question for the community: What failure modes have you hit in production that these don’t cover? Specifically curious about multi-agent patterns, webhook-triggered agents, and anything involving file system tools. Want to add recovery strategies for patterns we haven’t seen yet.

2 Likes

This is the right kind of boring reliability work. I like the pattern of treating retries, context budget, and approval as separate reusable pieces instead of one giant prompt. Not sexy, but it works.