Hi everyone, I’m currently looking for a good way to monitor / “watch” a workflow and automatically restart it if it stops or fails.
What I’m trying to achieve:
A workflow that runs continuously or processes larger batches
If it fails or stops unexpectedly, it should be restarted automatically
Ideally within a few minutes, since speed matters in my case
What I’ve already tried:
Enabled Retry on Fail and Continue on Error on nodes → helps in some cases, but not reliable enough
Errors still happen, e.g. API rate limits from external tools, where retries don’t really solve it
Added Wait nodes → not a real solution, just shifts the problem
Tried using a Schedule node to restart the workflow → but this sometimes causes collisions (workflow starts twice because the previous run is still active)
So right now I sometimes end up with workflows that stopped silently.
My questions:
Is there a recommended “watchdog” pattern in n8n for this?
Are there any nodes or built-in mechanisms that can check whether a workflow is still running and restart it if not?
How do you usually handle rate-limit-related failures in long-running or critical workflows without causing duplicate runs?
Any best practices, architecture patterns, or real-world examples would be super helpful.
When the workflow starts, register the execution in the Data Table as “running”.
If successful, update the row with status “success”.
If error, then the error handler should update the Data Table to signal the last execution ended with an error.
This allows you to have a very frequent trigger in your workflow (e.g., every 1 minute) that checks whether the previous execution is running, successful, or failed. From there, you can decide whether to continue executing or wait for the next check.
Great question this is a very real problem once workflows move beyond simple, short-lived runs.
I work a lot with long-running and batch-heavy n8n workflows, and there are solid watchdog-style patterns that avoid silent failures and duplicate runs.
I can help you design a setup that covers:
A lightweight “heartbeat” / execution-health check (so stopped workflows are detected reliably)
Safe restart logic that won’t trigger parallel or duplicate executions
Separation of worker workflows and supervisor workflows so failures are observable and recoverable
Idempotency patterns so retries don’t cause double-processing
There’s no single built-in “watchdog node,” but combining executions data, external state (DB/Redis), and a supervisor workflow is the approach that’s been most reliable for me in production.
If you want, share a bit about how your workflow is triggered (webhook, queue, cron, manual, etc.) or how long it typically runs, and I’m happy to walk through a concrete architecture that fits your case.