Workflow re-executing with same ID when Wait node over 64 seconds despite QUEUE_WORKER_LOCK_DURATION set to 120s or QUEUE_HEALTH_CHECK_ACTIVE set false

Describe the problem/error/question

We’re having trouble with a workflow in queue mode that includes a Wait node. Whenever the wait duration is 1 minute 5 seconds or greater, the workflow starts re-executing from the beginning — using the same execution ID. The previous execution continues for a while but stops shortly afterwards. The same thing happens with nodes that process too long, except maybe query nodes.

What is the error message (if any)?

There are no error messages.

Please share your workflow

Sorry, I’m not sure I’m allowed to share the details, but it’s reproducible with a scheduled trigger, into an edit fields node, to any code node (to debug execution ID, etc.), and to a wait node with a wait time of 65 seconds or more.

Information on your n8n setup

  • n8n version: 1.91
  • Database (default: SQLite): Postgres (Aurora)
  • n8n EXECUTIONS_PROCESS setting (default: own, main): queue
  • Running n8n via (Docker, npm, n8n cloud, desktop app): AWS
  • Operating system: Linux

Redis v8.0 on Linux

We have tried disabling health checks, and increasing the worker lock duration. Similar to advice here: https://community.n8n.io/t/workflow-picked-up-multiple-times-by-multiple-workers/33581

The CPU and memory load are consistently very low. The AWS tasks are not crashing or restarting.

Here is some additional information about the Environment Variables:
// only for the main service
// 8 vCPUs (8 × 1024)
// 16 GB RAM (16 × 1024)
N8N_RUNNERS_ENABLED = “true”;
N8N_RUNNERS_MODE = “internal”;
OFFLOAD_MANUAL_EXECUTIONS_TO_WORKERS = “true”;

// Worker and Main service
GENERIC_TIMEZONE: "Australia/Brisbane",
TZ: "Australia/Brisbane",
DB_POSTGRESDB_HOST: <hostname>,
DB_POSTGRESDB_USER: <username>,
DB_POSTGRESDB_PASSWORD: <password>,
DB_TYPE: "postgresdb",
EXECUTIONS_MODE: "queue",
QUEUE_HEALTH_CHECK_ACTIVE: "true",
QUEUE_BULL_REDIS_HOST: "redis.n8n",
QUEUE_BULL_REDIS_PORT: "<redacted number>",
QUEUE_WORKER_LOCK_DURATION = "120000";
QUEUE_HEALTH_CHECK_ACTIVE = "false";
N8N_LOG_LEVEL = "debug";

When running we log $execution.id and $workflow.id, these come through the same each time and the workflow re-executes indefinitely, the replaced execution sometimes goes a few nodes onward if there are further nodes connected, but it never gets far and always stops before being able to complete further work which exists for it.

We have four workers’ tasks and one main task.

I have 5 minutes of the tasks’ logs filtered to the repeating execution ID 8140.

main.txt

workers.txt

What is causing the re-execution? It will cause more work if we have to guarantee every execution is idempotent. How can we prevent it from happening? What are some further troubleshooting steps we can take?

  1. Don’t use Wait longer than 60 seconds in queued workflows. Replace it with an external Schedule or a loop with separate nodes.

  2. A more robust technical alternative: Use a Set to mark status and terminate the workflow, and launch a separate execution with Execute Workflow that waits 65 seconds.

  3. Confirming a bug or limitation on GitHub/n8n may be worth reporting this behavior on GitHub if it’s not documented.
    GitHub · Where software is built

  4. As an immediate workaround, try lowering the Wait to less than 60 seconds and see if it stabilizes.

Thanks Erick. Your suggestion to replace it with an external Schedule or a loop with separate nodes reminded me that I hadn’t added a simple 1-second wait into a looping test. We wouldn’t usually need to use a wait node in that 65s+ manner; it was used to reproduce the re-execution easily. But the limitation is reasonable and helpful to understand.

We have an Edit Fields (Set) node with 350 fields (an Oracle requirement) to set. When the number of input items into that large Edit Fields (Set) node reaches a sufficiently high number (9,000 will suffice), the node processing lasts long enough, and it triggers a re-execution. The same thing occurs when it is batched into lots of 3000 using the Loop Over Items node. However, I’ve found the easy solution is to add a 1s wait node in series with the 3k batch being looped into the Edit Fields (Set) node. It’s just a bit ugly and requires testing or knowing which nodes might need this treatment, as well as carefully watching for re-execution during testing. I’m guessing it will be any node that can consume the processor for 65+ seconds, including those where switching immediately re-enters a loop, except those we expect to use concurrency-aware coding, such as ‘await’.

Additionally, I found that the easy solution above will break if fed enough items (59,000 items, 20 loops). The loops appear to occur correctly, but at the finalisation of the last loop, it will get stuck processing and won’t exit the Loop Over Items node. It spikes the CPU, maxing it, and the memory to about 80%, before it normalises down to about 15-30% on both, but appears to be doing nothing. It eventually errors out after an hour. I’m not sure if increasing worker node resourcing would solve this, but we typically wouldn’t design integrations to process such a large number of items at once.

If designed with this limitation in mind, we might have refactored it to do something more sensible, such as creating the file with headers and writing batches of lines per execution, making the rest of the integration more resilient and able to handle higher loads. However, this solution should continue to work in the meantime until that optimisation is necessary.

After work deadlines expire, I’ll look into submitting a bug report on GitHub so that, if contributors are interested in it, they can confirm the limitation or suggest a change. New integrators could then refer to the documentation and make the necessary adjustments.

Perhaps we could also adjust this behaviour if it’s related to the execution active status and queueing? But that’s beyond my current resources to investigate on my own at the moment.

Thank you again for your help; it’s very helpful to learn more about the functionality.

Avoid Wait > 60 seconds in workflows running in queue mode.

Use alternatives such as:
Schedule trigger to resume.
Execute Workflow node to split logic.
Set node as a control flag and controlled restart.

Divide large processes into smaller blocks.
If you have a lot of items (like the 9,000 with Edit Fields), try splitting into smaller sub-batches.
Use controlled loops with short pauses (Wait 1s) between batches.

Monitor with N8N_LOG_LEVEL=debug
Check if the worker actually loses the lock or freezes (CPU/RAM at 100%).
Observe if a second worker takes the execution ID after 65s.

Consider regular mode if the processes are very heavy.
Queue mode is ideal for scalability, but not always suitable for workflows with long blocking nodes.

2 Likes

Marked above as Solution. Great work! And thank you. People will be thankful you’re in their community :smile: .

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.