Frequent Unknown failures with Unhandled Promise Rejection Warning

willfore · October 13, 2021, 2:15pm

Looking through my logs last night and noticed that I am seeing quite a bit of sub workflows failing due to unknown reason. When looking at the debugs I see the following printed every time this happens:


2021-10-13T14:10:58.670Z | verbose  | Start external error workflow {"errorWorkflowId":"1","workflowId":38,"file":"WorkflowExecuteAdditionalData.js","function":"executeErrorWorkflow"},
(node:7) UnhandledPromiseRejectionWarning: Error: job stalled more than maxStalledCount,
    at Queue.onFailed (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/job.js:516:18),
    at Queue.emit (events.js:315:20),
    at Queue.EventEmitter.emit (domain.js:467:12),
    at Redis.messageHandler (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/queue.js:444:14),
    at Redis.emit (events.js:315:20),
    at Redis.EventEmitter.emit (domain.js:467:12),
    at DataHandler.handleSubscriberReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:80:32),
    at DataHandler.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:47:18),
    at JavascriptRedisParser.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:21:22),
    at JavascriptRedisParser.execute (/usr/local/lib/node_modules/n8n/node_modules/redis-parser/lib/parser.js:544:14),
    at Socket.<anonymous> (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:25:20),
    at Socket.emit (events.js:315:20),
    at Socket.EventEmitter.emit (domain.js:467:12),
    at addChunk (internal/streams/readable.js:309:12),
    at readableAddChunk (internal/streams/readable.js:284:9),
    at Socket.Readable.push (internal/streams/readable.js:223:10),
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23),
(node:7) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 57)

Any ideas on how to resolve this?

Jon · October 14, 2021, 5:16am

Hey @willfore,

What version of n8n are you running and how do you have it setup? Does it seem to impact all workflows or just certain ones and do they have anything in common?

willfore · October 14, 2021, 1:15pm

I’m running 0.139.1 in queue mode on docker. It seems to only impact the sub workflow and the parent workflow fails as well of course but it shows a failure reason. All other workflows run fine.

sirdavidoff · October 15, 2021, 8:04am

Hi @willfore, thanks for posting.

Sadly our queue-mode expert is out until the 25th, but I’ll assign him to this thread so that he can pick it up when he gets back.

willfore · October 15, 2021, 2:58pm

Is there a way to change the value of maxStalledCount?

willfore · October 21, 2021, 2:59pm

I have upgraded to 0.144.0 and still having the same issues… it’s getting to the point now where most workflows don’t event run

jan · October 21, 2021, 6:52pm

Is it possible that your workflows do something blocking?

If you want to read up about stalled jobs, and when or why that happens it is best to check out bull which we use internally.

Here for example one place where they write about stalled jobs:

willfore · October 21, 2021, 8:40pm

I’m not sure what could be blocking… it’s a simple workflow and event has a 5 min timeout set on the entire workflow. It does make HTTP request calls but they have a 10sec timeout as well. Sometimes they run fine. Other times not. When they fail, or get stuck, it’s always due to the same reason.

krynble · October 26, 2021, 11:02am

Hey @willfore sorry about the issues you are having.

When your workflows get stuck, does the execution stay in Running status for long periods of time?

I know that you’ve set a timeout for the workflows execution, but depending on some situations, n8n cannot follow it.

Can you please detail how did you set the timeouts? You can set a workflow timeout, in the workflow settings, or you can set an HTTP Request timeout in the HTTP Request node.

With this I can try to better diagnose what is happening.

willfore · October 26, 2021, 1:22pm

Hi @krynble Yes, they all say running for very long periods of time, they are also hard to cancel and clear out. I have the timeout set on everyworkflow via workflow → settings. There are a few that also have a timeout set for 10 seconds on a HTTP requeest but not all. Could that be causing the issue? I am also now running 0.145

krynble · October 27, 2021, 8:02am

Thanks for the feedback! Yes, this can be an issue as n8n’s timeout that you set via Workflow → Settings actually works by preventing the execution of a node before it starts, but if n8n is stuck in a specific node and the timeout happens during this execution, we cannot stop it when n8n is running in main mode.

There is one way to try and diagnose this. Are you running n8n on your own infrastructure? Did you change the default value for the EXECUTIONS_PROCESS setting? The default value is own which allows n8n to stop executions correctly. If set to main then you may encounter the issue I mentioned above.

I would also recommend setting the timeout for the http request node as it forces a timeout on a per node level. n8n’s settings is workflow level-like and only acts in between nodes.

Let me know if this helps!

willfore · November 1, 2021, 4:12pm

Here is my docker config. The only thing that has helped so far is to stop running in queue mode. Maybe there is something missing in my config that you can point out?

n8n:
    image: n8nio/n8n
    restart: always
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_PORT=5432
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=${postgres_user}
      - DB_POSTGRESDB_PASSWORD=${postgres_pass}
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=${N8N_BASIC_AUTH_USER}
      - N8N_BASIC_AUTH_PASSWORD=${N8N_BASIC_AUTH_PASSWORD}
      - N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true
      - N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}
      - GENERIC_TIMEZONE=America/Chicago
      - WEBHOOK_TUNNEL_URL=n8n.cpirt.io
      - N8N_METRICS=true
      - QUEUE_BULL_REDIS_HOST=redis
      - QUEUE_BULL_REDIS_PORT=6379
      - EXECUTIONS_MODE=queue
    ports:
      - 5678:5678
    networks:
      - db_data
      - proxy
    depends_on:
      - postgres
      - redis
    volumes:
      - n8n:/home/node/.n8n
    # Wait 5 seconds to start n8n to make sure that PostgreSQL is ready
    # when n8n tries to connect to it
    command: /bin/sh -c "sleep 15; n8n start"

krynble · November 3, 2021, 8:25am

Hey @willfore

Thanks for sharing your configuration. It seems to be all right there. Nothing I can point right now.

Another thing we can try is to diagnose where n8n stops running and hangs up.

In your workflow settings, please activate “Save execution progress” to “Yes”. With this setting on, n8n saves to database the result of each node executed.

Workflows stuck in running state can be seen if you tamper the URL by informing the execution ID, like execution/107 shows you execution ID 107.

In this scenario, you can force open the execution of a running workflow and by refreshing your browser window you can see the progress, as it gets loaded from the database for each node executed. This would allow you to pinpoint exactly what node is stuck.

I hope this also helps you finding the problem. Also, feel free to share your workflow so we can see what’s happening. Please remember to remove any possibly sensitive data.

Miguel_Caballero_Pin · February 22, 2022, 6:28pm

I have the same issue with a workflow that only makes PostgreSQL queries using a few PostgreSQL nodes. Looking at the Bull documentation, it seems like reason for this error may be more likely this one:

Blockquote

The Node process running your job processor unexpectedly terminates.

Since those nodes are not CPU intensive and should not block the CPU.

In our case we have deployed n8n workers in several containers using Kubernetes, so I looked at the upscale/downscale behavior and the issue correlates with a downscale of containers:

Screen Shot 2022-02-22 at 10.26.54 AM

This makes me believe that maybe the n8n workers do not properly handle the SIGTERM signal from the pod and the worker process is halted abruptly. Could this be possible @krynble ?

krynble · February 24, 2022, 9:06am

Hey @Miguel_Caballero_Pin

Interesting, theoretically n8n receives the SIGTERM and intercepts it, trying to gracefully shutdown, allowing 30 seconds for running executions to finish.

Is there a chance your executions take more than 30 seconds?

You can see the startup and shutdown code by browsing the following file: n8n/worker.ts at master · n8n-io/n8n · GitHub

You can see on line 227 where we register a shutdown function that is declared on line 72. It first stops receiving new requests and then enters the waiting for current executions.

Does your Kubernetes cluster allow some time for the pods to shut down? What is the timeout it provides?

Miguel_Caballero_Pin · March 1, 2022, 11:58pm

Is there a chance your executions take more than 30 seconds?

Yes, those workflows take minutes to run.

Does your Kubernetes cluster allow some time for the pods to shut down? What is the timeout it provides?

That is a good point. We are not setting it up explicitly, which means it is defaulting to 30 seconds, same as what you have set up. It seems we do not have much options on our end. n8n could provide a parameter to set a custom termination grace period so we can increase it on both sides, n8n and kubernetes.

Miguel_Caballero_Pin · March 2, 2022, 4:05am

Something that may help also is being able to retry those jobs. Currently when a failure like that one happens jobs are not retryable.

krynble · March 3, 2022, 9:42am

Hey Miguel.

Unfortunately the timer is currently not configurable, it is hardcoded to 30 seconds. If you’re feeling like contributing, feel free to create a PR

Regarding the retry issue, this is def a bug and will be added to our to-do list.

sGendrot · August 9, 2022, 12:52pm

Hi @krynble , I have any news on the retry capacity ?

In my case, I use N8N in Kubernetes and I want to use KEDA to autoscale the worker, but I need a way to retry jobs when workers were stopped before the end of the job.

krynble · September 2, 2022, 1:06pm

Hey @sGendrot I’ve created a PR allowing you to set up the time n8n waits for running execution before exiting.

You can see the PR here . Should be released soon.

Please note that if it still happens that a worker process is unable to finish the execution in time, it stays locked for a while but then another worker is able to pick it up, so you’re never losing that information.