Restart error workflows

Hi everyone, I’m currently looking for a good way to monitor / “watch” a workflow and automatically restart it if it stops or fails.

What I’m trying to achieve:

  • A workflow that runs continuously or processes larger batches

  • If it fails or stops unexpectedly, it should be restarted automatically

  • Ideally within a few minutes, since speed matters in my case

What I’ve already tried:

  • Enabled Retry on Fail and Continue on Error on nodes → helps in some cases, but not reliable enough

  • Errors still happen, e.g. API rate limits from external tools, where retries don’t really solve it

  • Added Wait nodes → not a real solution, just shifts the problem

  • Tried using a Schedule node to restart the workflow → but this sometimes causes collisions (workflow starts twice because the previous run is still active)

So right now I sometimes end up with workflows that stopped silently.

My questions:

  • Is there a recommended “watchdog” pattern in n8n for this?

  • Are there any nodes or built-in mechanisms that can check whether a workflow is still running and restart it if not?

  • How do you usually handle rate-limit-related failures in long-running or critical workflows without causing duplicate runs?

Any best practices, architecture patterns, or real-world examples would be super helpful.

Thanks a lot in advance! :raising_hands:

You can use n8n data tables for this.

  1. Create a table that acts like a status board for the workflow.

  2. Create an error handler workflow specifically for this workflow and enable the error handler in your workflow settings

  3. When the workflow starts, register the execution in the Data Table as “running”.

  4. If successful, update the row with status “success”.

  5. If error, then the error handler should update the Data Table to signal the last execution ended with an error.

This allows you to have a very frequent trigger in your workflow (e.g., every 1 minute) that checks whether the previous execution is running, successful, or failed. From there, you can decide whether to continue executing or wait for the next check.

Makes sense?

1 Like

Great question this is a very real problem once workflows move beyond simple, short-lived runs.

I work a lot with long-running and batch-heavy n8n workflows, and there are solid watchdog-style patterns that avoid silent failures and duplicate runs.

I can help you design a setup that covers:

  • A lightweight “heartbeat” / execution-health check (so stopped workflows are detected reliably)

  • Safe restart logic that won’t trigger parallel or duplicate executions

  • Rate-limit aware backoff patterns (token buckets, execution locks, cooldown windows)

  • Separation of worker workflows and supervisor workflows so failures are observable and recoverable

  • Idempotency patterns so retries don’t cause double-processing

There’s no single built-in “watchdog node,” but combining executions data, external state (DB/Redis), and a supervisor workflow is the approach that’s been most reliable for me in production.

If you want, share a bit about how your workflow is triggered (webhook, queue, cron, manual, etc.) or how long it typically runs, and I’m happy to walk through a concrete architecture that fits your case.