Workflow picked up multiple times by multiple workers

Describe the problem/error/question

We have a workflow in production, so on a schedule, that is a pipeline that runs lots of child workflows in a certain order.

When the schedule is triggered, it is picked up by a worker and all is well, but after some time, and apparently every time at the same point in the workflow, the main pipeline is picked up again by the second node, with the same execution ID.

And so the main pipeline is running two times, in parallel, on two different nodes, with the same execution ID =D

I spent a few days trying understand the problem but I don’t seem to be able to find a solution. There are no loops in the main pipeline nor in sub workflows, and once again I insist but the job is picked up by the second node with the same job ID and execution ID as the first one, which is unsettling.

Here is an extract of the logs that I have, please ask if more is required:

2023-11-29T07:00:00.009Z | debug    | Received trigger for workflow "xxxxx" "{ file: 'ActiveWorkflowRunner.js', function: 'returnFunctions.emit' }"
2023-11-29T07:00:00.291Z | info     | Start job: 397 (Workflow ID: 1vIL7sBPUvj28mTx | Execution: 7261) "{ file: 'worker.js', function: 'runJob' }" // running on boondn8n-boond-n8n-worker-5f7d5f7f7c-rfvrk
2023-11-29T07:00:00.879Z | verbose  | Workflow execution started "{\n  workflowId: '1vIL7sBPUvj28mTx',\n  file: 'LoggerProxy.js',\n  function: 'exports.verbose'\n}"
2023-11-29T07:00:00.887Z | debug    | Start processing node "Schedule Trigger" "{\n  node: 'Schedule Trigger',\n  workflowId: '1vIL7sBPUvj28mTx',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-11-29T07:00:00.888Z | debug    | Running node "Schedule Trigger" started "{\n  node: 'Schedule Trigger',\n  workflowId: '1vIL7sBPUvj28mTx',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-11-29T07:00:00.889Z | debug    | Running node "Schedule Trigger" finished successfully "{\n  node: 'Schedule Trigger',\n  workflowId: '1vIL7sBPUvj28mTx',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-11-29T07:00:00.891Z | debug    | Save execution progress to database for execution ID 7261  "{\n  executionId: '7261',\n  nodeName: 'Schedule Trigger',\n  file: 'WorkflowExecuteAdditionalData.js',\n  function: 'nodeExecuteAfter'\n}"
.....
2023-11-29T07:00:55.511Z | verbose  | Workflow execution finished successfully "{\n  workflowId: 'hXCPn6Y8HKAfs5Ud',\n  file: 'LoggerProxy.js',\n  function: 'exports.verbose'\n}"
2023-11-29T07:00:55.511Z | debug    | Executing hook (hookFunctionsSave) "{\n  executionId: '7265',\n  workflowId: 'hXCPn6Y8HKAfs5Ud',\n  file: 'WorkflowExecuteAdditionalData.js',\n  function: 'workflowExecuteAfter'\n}"
2023-11-29T07:00:55.512Z | debug    | Save execution data to database for execution ID 7265 "{\n  executionId: '7265',\n  workflowId: 'hXCPn6Y8HKAfs5Ud',\n  finished: true,\n  stoppedAt: 2023-11-29T07:00:55.510Z,\n  file: 'sharedHookFunctions.js',\n  function: 'updateExistingExecution'\n}"
2023-11-29T07:01:11.070Z | info     | Start job: 397 (Workflow ID: 1vIL7sBPUvj28mTx | Execution: 7261) "{ file: 'worker.js', function: 'runJob' }" // running on boondn8n-boond-n8n-worker-5f7d5f7f7c-9plz6
2023-11-29T07:01:11.095Z | verbose  | Workflow execution started "{\n  workflowId: '1vIL7sBPUvj28mTx',\n  file: 'LoggerProxy.js',\n  function: 'exports.verbose'\n}"

The second pipeline seems to start always at the same point in logs, I check for a few days and it always does start at the same point (of the workflow, time can vary)

Please share your workflow

I cannot exactly share the workflow, but it is a very simple workflow calling lots of sub workflows. I checked and no, the workflow does not call itself back, and even if I believe the execution ID would be different, plus according to what I’ve read it would run on the same worker as the first one.

Information on your n8n setup

  • n8n version: 1.17.1 (freshly updated from this morning)
  • Database (default: SQLite): PostgreSQL
  • n8n EXECUTIONS_PROCESS setting (default: own, main): main, queue mode, bull plugged on a single instance of redis
  • Running n8n via (Docker, npm, n8n cloud, desktop app): Docker, in a k8s cluster with this helm chart
  • Operating system: Linux x64 (aws eks worker nodes)

We have one main instance of n8n and two worker nodes

In the hope that my problem was clear enough and that someone can help.

Thank you in advance.

Hey @Webvoid,

Welcome to the community :raised_hands:

We might need to summon @krynble for this one but the first thing I would do is make sure that you do only have one main instance running and that the workers are correctly started as workers and not main instances.

Thanks for the welcome =)

I do confirm that we only have one single main instance running the default command of the docker image, and two worker nodes running with n8n worker --concurrency=2

main node

2023-11-29T06:44:22.358Z | info     | Initializing n8n process "{ file: 'start.js', function: 'init' }"
2023-11-29T06:44:26.135Z | debug    | Initializing event bus... "{ file: 'MessageEventBus.js', function: 'initialize' }"
2023-11-29T06:44:26.138Z | debug    | Initializing event writer "{ file: 'MessageEventBus.js', function: 'initialize' }"

workers

2023-11-29T06:45:16.585Z | debug    | Starting n8n worker... "{ file: 'worker.js', function: 'init' }"
2023-11-29T06:45:17.326Z | debug    | Initializing event bus... "{ file: 'MessageEventBus.js', function: 'initialize' }"
2023-11-29T06:45:17.328Z | debug    | Initializing event writer "{ file: 'MessageEventBus.js', function: 'initialize' }"
2023-11-29T06:45:18.676Z | debug    | Starting n8n worker... "{ file: 'worker.js', function: 'init' }"
2023-11-29T06:45:19.290Z | debug    | Initializing event bus... "{ file: 'MessageEventBus.js', function: 'initialize' }"
2023-11-29T06:45:19.292Z | debug    | Initializing event writer "{ file: 'MessageEventBus.js', function: 'initialize' }"

Hey @Webvoid,

Can you output the list of containers running in your k8s cluster?

Indeed, but I only personnally have access to an ArgoCD which I join the screen capture
image

If you require more informations, I can still get them tho, just with a little more effort ^^

1 Like

Hey @Webvoid,

That looks ok to me, We will need to wait for @krynble to see if he has any thoughts.

Hello!

Indeed I have a good idea of what might be happening here.

Especially when you say that this happens always at the same point, this is very probably because the first instance is probably doing a CPU intensive operation.

When this happens, n8n is unable to perform the “heartbeat” indicating the worker is healthy and processing the execution. When this fails, another worker picks it up and tries to continue the work, effectively duplicating it.

TL;DR: You can increase the lock time (default 30s; specified in ms as 30000) to a higher value, so the worker has more time between heartbeats by using the QUEUE_WORKER_LOCK_DURATION environment variable. This is set in milisseconds so I’d suggest trying 60000 first and then maybe increasing this value to 90000.

The drawback is that if latency is sensitive, where you have workflows responding to HTTP requests and a worker fails, it’ll take much longer until the next worker picks up the work.

You can also change QUEUE_WORKER_LOCK_RENEW_TIME which defaults to 15000 to a lower or higher value which indicates the frequency with which the worker will perform the heartbeat.

Please try those changes and let us know if this solves the problem.

3 Likes

Thanks, I will try that.

This topic was automatically closed 7 hours after the last reply. New replies are no longer allowed.