🛡️ Automated Error Monitoring for n8n (3 workflows, 46 nodes, zero AI)

Hey everyone :waving_hand:

I built an error monitoring system for my self-hosted n8n instance and wanted to share it.

What it does: When any monitored workflow fails, Supervisor catches the error, classifies its severity, and sends a formatted Telegram alert — with the execution link, retry guidance, and circuit breaker status. It also runs a heartbeat monitor every 5 minutes and pings Healthchecks.io as a dead man’s switch.

Three safety layers: Telegram alert → cache if Telegram fails → Healthchecks catches silence if everything fails.

Stack: n8n + Neon PostgreSQL (free tier) + Telegram Bot + Healthchecks.io

Architecture:

Workflows Nodes DB Tables AI Tokens/Error Monthly Cost
V1 Lean 3 46 3 0 $0
  • Workflow 1 — Supervisor Core: Error intake, severity classification, credential sanitization, deduplication, circuit breaker, Telegram alerts

  • Workflow 2 — Heartbeat Monitor: Circuit breaker auto-recovery, error rate trends, Healthchecks.io ping, orphan execution detection via n8n API

  • Workflow 3 — Data Retention: Automated cleanup every 12 hours

The backstory: This is actually a rebuild. The original version (V23) was a 139-node, 8-workflow, AI-powered system that used DeepSeek to triage errors, search historical fixes, and suggest repairs. It was impressive engineering — and it cost ~500,000 tokens per error just to send a Telegram message that said “this workflow failed on this node with this error.”

The same information was already in the Error Trigger payload before any AI processing.

So I stripped it down to what actually mattered: catch the error, tell me, and make sure silence itself is an alarm.

GitHub (open source): :backhand_index_pointing_right: GitHub - chanrylejay/supervisor-error-handling-workflow: n8n automation that handles my error handling · GitHub

Includes full workflow JSON exports, database schema, and setup documentation.

Happy to answer any questions. If you’re running self-hosted n8n and want error monitoring without paying for external services, this might help. :slightly_smiling_face:

2 Likes

The circuit breaker + deduplication combination in Workflow 1 is exactly the kind of thing that makes error monitoring actually usable in production - without deduplication you’d drown in duplicate alerts during a cascade failure. The rebuild from 139 nodes (AI-powered) to 46 nodes (zero AI) is also a good call, simpler systems are easier to trust when you’re debugging the monitoring system itself.

1 Like

@perfectc
Wow, that’s really cool. I haven’t found another ready-made solution like this on the market. Did you ever think about using ping with a timestamp to not rely solely on the platform’s log?

Thanks! Yeah the dedup was born from real pain , one Postgres timeout would fire 4-5 identical alerts in seconds. You start ignoring them, and that defeats the whole point. And honestly the rebuild was humbling. Hard to justify 500K tokens when the Error Trigger payload already has everything you need. Plus you really don’t want to be debugging your debugging system.

Appreciate the feedback!

Thank you! right now the heartbeat workflow pings Healthchecks.io every 5 minutes, so if n8n itself goes down or the workflow silently stops, Healthchecks catches the silence and alerts me. That’s basically the “ping with timestamp” idea but outsourced to a free external service. I did consider logging timestamps locally in Postgres as a backup, but figured if n8n is down, it can’t write to the DB either. So having an external service watching for silence felt more reliable. If Healthchecks stops hearing from me something is wrong, guaranteed.

1 Like

Perfect! I hadn’t thought like that.

1 Like