Frequent Unknown failures with Unhandled Promise Rejection Warning

Looking through my logs last night and noticed that I am seeing quite a bit of sub workflows failing due to unknown reason. When looking at the debugs I see the following printed every time this happens:

2021-10-13T14:10:58.670Z | verbose  | Start external error workflow {"errorWorkflowId":"1","workflowId":38,"file":"WorkflowExecuteAdditionalData.js","function":"executeErrorWorkflow"},
(node:7) UnhandledPromiseRejectionWarning: Error: job stalled more than maxStalledCount,
    at Queue.onFailed (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/job.js:516:18),
    at Queue.emit (events.js:315:20),
    at Queue.EventEmitter.emit (domain.js:467:12),
    at Redis.messageHandler (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/queue.js:444:14),
    at Redis.emit (events.js:315:20),
    at Redis.EventEmitter.emit (domain.js:467:12),
    at DataHandler.handleSubscriberReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:80:32),
    at DataHandler.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:47:18),
    at JavascriptRedisParser.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:21:22),
    at JavascriptRedisParser.execute (/usr/local/lib/node_modules/n8n/node_modules/redis-parser/lib/parser.js:544:14),
    at Socket.<anonymous> (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:25:20),
    at Socket.emit (events.js:315:20),
    at Socket.EventEmitter.emit (domain.js:467:12),
    at addChunk (internal/streams/readable.js:309:12),
    at readableAddChunk (internal/streams/readable.js:284:9),
    at Socket.Readable.push (internal/streams/readable.js:223:10),
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23),
(node:7) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see (rejection id: 57)

Any ideas on how to resolve this?

Hey @willfore,

What version of n8n are you running and how do you have it setup? Does it seem to impact all workflows or just certain ones and do they have anything in common?

I’m running 0.139.1 in queue mode on docker. It seems to only impact the sub workflow and the parent workflow fails as well of course but it shows a failure reason. All other workflows run fine.

Hi @willfore, thanks for posting.

Sadly our queue-mode expert is out until the 25th, but I’ll assign him to this thread so that he can pick it up when he gets back.

Is there a way to change the value of maxStalledCount?

I have upgraded to 0.144.0 and still having the same issues… it’s getting to the point now where most workflows don’t event run

Is it possible that your workflows do something blocking?

If you want to read up about stalled jobs, and when or why that happens it is best to check out bull which we use internally.

Here for example one place where they write about stalled jobs:

I’m not sure what could be blocking… it’s a simple workflow and event has a 5 min timeout set on the entire workflow. It does make HTTP request calls but they have a 10sec timeout as well. Sometimes they run fine. Other times not. When they fail, or get stuck, it’s always due to the same reason.

Hey @willfore sorry about the issues you are having.

When your workflows get stuck, does the execution stay in Running status for long periods of time?

I know that you’ve set a timeout for the workflows execution, but depending on some situations, n8n cannot follow it.

Can you please detail how did you set the timeouts? You can set a workflow timeout, in the workflow settings, or you can set an HTTP Request timeout in the HTTP Request node.

With this I can try to better diagnose what is happening.

Hi @krynble Yes, they all say running for very long periods of time, they are also hard to cancel and clear out. I have the timeout set on everyworkflow via workflow → settings. There are a few that also have a timeout set for 10 seconds on a HTTP requeest but not all. Could that be causing the issue? I am also now running 0.145

Thanks for the feedback! Yes, this can be an issue as n8n’s timeout that you set via Workflow → Settings actually works by preventing the execution of a node before it starts, but if n8n is stuck in a specific node and the timeout happens during this execution, we cannot stop it when n8n is running in main mode.

There is one way to try and diagnose this. Are you running n8n on your own infrastructure? Did you change the default value for the EXECUTIONS_PROCESS setting? The default value is own which allows n8n to stop executions correctly. If set to main then you may encounter the issue I mentioned above.

I would also recommend setting the timeout for the http request node as it forces a timeout on a per node level. n8n’s settings is workflow level-like and only acts in between nodes.

Let me know if this helps!

Here is my docker config. The only thing that has helped so far is to stop running in queue mode. Maybe there is something missing in my config that you can point out?

    image: n8nio/n8n
    restart: always
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_USER=${postgres_user}
      - DB_POSTGRESDB_PASSWORD=${postgres_pass}
      - N8N_BASIC_AUTH_ACTIVE=true
      - GENERIC_TIMEZONE=America/Chicago
      - N8N_METRICS=true
      - EXECUTIONS_MODE=queue
      - 5678:5678
      - db_data
      - proxy
      - postgres
      - redis
      - n8n:/home/node/.n8n
    # Wait 5 seconds to start n8n to make sure that PostgreSQL is ready
    # when n8n tries to connect to it
    command: /bin/sh -c "sleep 15; n8n start"

Hey @willfore

Thanks for sharing your configuration. It seems to be all right there. Nothing I can point right now.

Another thing we can try is to diagnose where n8n stops running and hangs up.

In your workflow settings, please activate “Save execution progress” to “Yes”. With this setting on, n8n saves to database the result of each node executed.

Workflows stuck in running state can be seen if you tamper the URL by informing the execution ID, like execution/107 shows you execution ID 107.

In this scenario, you can force open the execution of a running workflow and by refreshing your browser window you can see the progress, as it gets loaded from the database for each node executed. This would allow you to pinpoint exactly what node is stuck.

I hope this also helps you finding the problem. Also, feel free to share your workflow so we can see what’s happening. Please remember to remove any possibly sensitive data.