What it does: When any monitored workflow fails, Supervisor catches the error, classifies its severity, and sends a formatted Telegram alert — with the execution link, retry guidance, and circuit breaker status. It also runs a heartbeat monitor every 5 minutes and pings Healthchecks.io as a dead man’s switch.
Three safety layers: Telegram alert → cache if Telegram fails → Healthchecks catches silence if everything fails.
Workflow 2 — Heartbeat Monitor: Circuit breaker auto-recovery, error rate trends, Healthchecks.io ping, orphan execution detection via n8n API
Workflow 3 — Data Retention: Automated cleanup every 12 hours
The backstory: This is actually a rebuild. The original version (V23) was a 139-node, 8-workflow, AI-powered system that used DeepSeek to triage errors, search historical fixes, and suggest repairs. It was impressive engineering — and it cost ~500,000 tokens per error just to send a Telegram message that said “this workflow failed on this node with this error.”
The same information was already in the Error Trigger payload before any AI processing.
So I stripped it down to what actually mattered: catch the error, tell me, and make sure silence itself is an alarm.
The circuit breaker + deduplication combination in Workflow 1 is exactly the kind of thing that makes error monitoring actually usable in production - without deduplication you’d drown in duplicate alerts during a cascade failure. The rebuild from 139 nodes (AI-powered) to 46 nodes (zero AI) is also a good call, simpler systems are easier to trust when you’re debugging the monitoring system itself.
@perfectc
Wow, that’s really cool. I haven’t found another ready-made solution like this on the market. Did you ever think about using ping with a timestamp to not rely solely on the platform’s log?
Thanks! Yeah the dedup was born from real pain , one Postgres timeout would fire 4-5 identical alerts in seconds. You start ignoring them, and that defeats the whole point. And honestly the rebuild was humbling. Hard to justify 500K tokens when the Error Trigger payload already has everything you need. Plus you really don’t want to be debugging your debugging system.
Thank you! right now the heartbeat workflow pings Healthchecks.io every 5 minutes, so if n8n itself goes down or the workflow silently stops, Healthchecks catches the silence and alerts me. That’s basically the “ping with timestamp” idea but outsourced to a free external service. I did consider logging timestamps locally in Postgres as a backup, but figured if n8n is down, it can’t write to the DB either. So having an external service watching for silence felt more reliable. If Healthchecks stops hearing from me something is wrong, guaranteed.