After upgrading my self-hosted n8n instance to v2.x (in queue mode), I’m noticing that the number of executions that are failing has increased with no material changes to my infrastructure.
I’m not sure why this happened, but there was some sort of fundamental change in the v2.x logic which has bumped my execution error rates from like (1-2% on v1.x) to like (5-20% on v2.x).
Can someone from the development team comment on whether or not this is a known issue and whether there’s any active investigation on resolving this issue?
Have you read through the breaking changes in the documentation to see if there are any changes you need to make to avoid previously working workflows from breaking in version 2?
Also check the migration tool under settings if it is reporting anything you need to action
If yes, then can you share one or two workflow errors here so we can see the type of errors you are now getting
@Wouter_Nigrini , yes, I’ve read through the breaking changes. None of my workflows use sub-workflows, so that’s not the problem.
I’m about 95% confident, though, that this change is likely the culprit (Remove QUEUE_WORKER_MAX_STALLED_COUNT):
Unfortunately, there is nothing I can do to workaround this problem, as this appears to be a fundamental change to how n8n is performing queue retry logic.
Almost ALL of the errors look like this:
Error: This execution failed to be processed too many times and will no longer retry. To allow this execution to complete, please break down your workflow or scale up your workers or adjust your worker settings. at /usr/local/lib/node_modules/n8n/src/workflow-runner.ts:438:15 at processTicksAndRejections (node:internal/process/task_queues:105:5)
These workflows run easily within 5 mins or less, but 20% of the time, these errors get thrown across almost every workflow randomly since upgrading to n8n v2.x.
Hi @Wouter_Nigrini , I use Google Cloud Run to deploy the main node as a service – here’s the corresponding YAML for that (sensitive info was REDACTED):
For the worker nodes, here’s the corresponding YAML. I specifically set concurrency to be 1, because I have a sidecar container that acts as an external task runner. I usually spin up between 14 to 30 instances of worker nodes (which worked just fine with n8n v1.x):
I was able to reduce the frequency of these errors by changing these values within my system:
QUEUE_WORKER_LOCK_DURATION
3600000
QUEUE_WORKER_LOCK_RENEW_TIME
120000
QUEUE_WORKER_STALLED_INTERVAL
240000
However, that doesn’t fully solve the problem – it simply reduces the frequency of the problem. I’d really like an n8n developer to chime on how the long-term issue will get resolved.
Thanks for sharing your insights openly in this discussion. I joined this great thread because currently my team and I are experimenting with deploying a self-hosted AWS ECS cluster in queue mode.
Small workflows worked well, but with heavier workflows we’re also experiencing the `Error: This execution failed to be processed too many times and will no longer retry. To allow this execution to complete, please break down your workflow or scale up your workers or adjust your worker settings.` error.
@Darien_Kindlund Yesterday I tried bumping the default values of the Env Vars you mentioned but to much lower values, and it failed.
I had
{
name = "QUEUE_WORKER_LOCK_DURATION"
value = "180000"
},
{
name = "QUEUE_WORKER_LOCK_RENEW_TIME"
value = "30000"
},
{
name = "QUEUE_WORKER_STALLED_INTERVAL"
value = "60000"
},
Now I used the same values you recommended:
{
name = "QUEUE_WORKER_LOCK_DURATION"
value = "3600000"
},
{
name = "QUEUE_WORKER_LOCK_RENEW_TIME"
value = "120000"
},
{
name = "QUEUE_WORKER_STALLED_INTERVAL"
value = "240000"
},
I’m waiting to give it one more try with these higher values later today.
@Darien_Kindlund I wonder if you managed to pinpoint any other reason for these failures or figured a definitive fix.
I also found a similar error recently when trying to upload a large file using the Google Drive node. I managed to resolve the issue by adding the following env vars to both my main and worker instances:
Now this all depends on your available memory on the ECS instances as well as what is going on inside of your workflow which is failing. If the above does not resolve your issue, then I would suggest we look at your workflow next, the size and complexity as well as potentially the way you’re handing data through there
Thanks a lot @Wouter_Nigrini that’s useful too.
So far the test is going well.
It processes a large batch of records in pages. It will take a couple of hours but so far is holding up.
^ I think this just controls how much data you can “pin” in an n8n workflow for testing purposes… I don’t think this env variable is used in production.
QUEUE_EXECUTIONS_DATA_MAX_SIZE
^ I don’t see any indicating that env variable is actually used within the n8n codebase … can you provide a pointer to where it’s actually used?
Hi @Darien_Kindlund I think you’re correct actually. I think then it must have been a combination of
NODE*_OPTIONS=–max-old-space-size=4096 and N8N_PAYLOAD_SIZE_*MAX=134217728 which fixed it for me. I had the same error whether i ran the workflow manually or on a scheduler.
@Wouter_Nigrini , yes I’m still seeing the issue; but I have simply masked the errors and layered in additional retry logic. It’s annoying but it works for now.
How to you implement the retry logic ? Because for me when i have this error i can’t retry the workflow and i can’t see the input data of the workflow to retry it
n8n recommends setting concurrency to 5 or higher for your worker instances. Setting low concurrency values with a large numbers of workers can exhaust your database’s connection pool, leading to processing delays and failures.
Have you tried increasing the worker concurrency? I also have a sidecar container running for the external task runner with a concurrency of 10 and don’t experience any problems. I don’t think in-process/in-container concurrency should cause an issue - or are you worried about overwhelming the external task runner?
@all-you-can-pete , I don’t have DB connection pool problems. I specifically want a worker concurrency of 1 due to the workloads I’m running. Each job is intense and requires a dedicated external task runner container. If I run two jobs within the same external task runner container instance, it’s likely the job won’t complete because it’s out of resources.
Fair enough, was more just being hopeful that it was related and would be possible to increase the concurrency as the warning logged is a lot more generic (“CAN LEAD TO AN UNSTABLE ENVIRONMENT”) than the docs explicitly calling out potential DB issues. Appreciate it’s difficult to do tho with intense jobs without running into other resource constraints.