Frequent Unknown failures with Unhandled Promise Rejection Warning

Looking through my logs last night and noticed that I am seeing quite a bit of sub workflows failing due to unknown reason. When looking at the debugs I see the following printed every time this happens:


2021-10-13T14:10:58.670Z | verbose  | Start external error workflow {"errorWorkflowId":"1","workflowId":38,"file":"WorkflowExecuteAdditionalData.js","function":"executeErrorWorkflow"},
(node:7) UnhandledPromiseRejectionWarning: Error: job stalled more than maxStalledCount,
    at Queue.onFailed (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/job.js:516:18),
    at Queue.emit (events.js:315:20),
    at Queue.EventEmitter.emit (domain.js:467:12),
    at Redis.messageHandler (/usr/local/lib/node_modules/n8n/node_modules/bull/lib/queue.js:444:14),
    at Redis.emit (events.js:315:20),
    at Redis.EventEmitter.emit (domain.js:467:12),
    at DataHandler.handleSubscriberReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:80:32),
    at DataHandler.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:47:18),
    at JavascriptRedisParser.returnReply (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:21:22),
    at JavascriptRedisParser.execute (/usr/local/lib/node_modules/n8n/node_modules/redis-parser/lib/parser.js:544:14),
    at Socket.<anonymous> (/usr/local/lib/node_modules/n8n/node_modules/ioredis/built/DataHandler.js:25:20),
    at Socket.emit (events.js:315:20),
    at Socket.EventEmitter.emit (domain.js:467:12),
    at addChunk (internal/streams/readable.js:309:12),
    at readableAddChunk (internal/streams/readable.js:284:9),
    at Socket.Readable.push (internal/streams/readable.js:223:10),
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23),
(node:7) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 57)

Any ideas on how to resolve this?

Hey @willfore,

What version of n8n are you running and how do you have it setup? Does it seem to impact all workflows or just certain ones and do they have anything in common?

I’m running 0.139.1 in queue mode on docker. It seems to only impact the sub workflow and the parent workflow fails as well of course but it shows a failure reason. All other workflows run fine.

Hi @willfore, thanks for posting.

Sadly our queue-mode expert is out until the 25th, but I’ll assign him to this thread so that he can pick it up when he gets back.

Is there a way to change the value of maxStalledCount?

I have upgraded to 0.144.0 and still having the same issues… it’s getting to the point now where most workflows don’t event run

Is it possible that your workflows do something blocking?

If you want to read up about stalled jobs, and when or why that happens it is best to check out bull which we use internally.

Here for example one place where they write about stalled jobs:

I’m not sure what could be blocking… it’s a simple workflow and event has a 5 min timeout set on the entire workflow. It does make HTTP request calls but they have a 10sec timeout as well. Sometimes they run fine. Other times not. When they fail, or get stuck, it’s always due to the same reason.

Hey @willfore sorry about the issues you are having.

When your workflows get stuck, does the execution stay in Running status for long periods of time?

I know that you’ve set a timeout for the workflows execution, but depending on some situations, n8n cannot follow it.

Can you please detail how did you set the timeouts? You can set a workflow timeout, in the workflow settings, or you can set an HTTP Request timeout in the HTTP Request node.

With this I can try to better diagnose what is happening.

Hi @krynble Yes, they all say running for very long periods of time, they are also hard to cancel and clear out. I have the timeout set on everyworkflow via workflow → settings. There are a few that also have a timeout set for 10 seconds on a HTTP requeest but not all. Could that be causing the issue? I am also now running 0.145

Thanks for the feedback! Yes, this can be an issue as n8n’s timeout that you set via Workflow → Settings actually works by preventing the execution of a node before it starts, but if n8n is stuck in a specific node and the timeout happens during this execution, we cannot stop it when n8n is running in main mode.

There is one way to try and diagnose this. Are you running n8n on your own infrastructure? Did you change the default value for the EXECUTIONS_PROCESS setting? The default value is own which allows n8n to stop executions correctly. If set to main then you may encounter the issue I mentioned above.

I would also recommend setting the timeout for the http request node as it forces a timeout on a per node level. n8n’s settings is workflow level-like and only acts in between nodes.

Let me know if this helps!

Here is my docker config. The only thing that has helped so far is to stop running in queue mode. Maybe there is something missing in my config that you can point out?

n8n:
    image: n8nio/n8n
    restart: always
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_PORT=5432
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=${postgres_user}
      - DB_POSTGRESDB_PASSWORD=${postgres_pass}
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=${N8N_BASIC_AUTH_USER}
      - N8N_BASIC_AUTH_PASSWORD=${N8N_BASIC_AUTH_PASSWORD}
      - N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true
      - N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}
      - GENERIC_TIMEZONE=America/Chicago
      - WEBHOOK_TUNNEL_URL=n8n.cpirt.io
      - N8N_METRICS=true
      - QUEUE_BULL_REDIS_HOST=redis
      - QUEUE_BULL_REDIS_PORT=6379
      - EXECUTIONS_MODE=queue
    ports:
      - 5678:5678
    networks:
      - db_data
      - proxy
    depends_on:
      - postgres
      - redis
    volumes:
      - n8n:/home/node/.n8n
    # Wait 5 seconds to start n8n to make sure that PostgreSQL is ready
    # when n8n tries to connect to it
    command: /bin/sh -c "sleep 15; n8n start"

Hey @willfore

Thanks for sharing your configuration. It seems to be all right there. Nothing I can point right now.

Another thing we can try is to diagnose where n8n stops running and hangs up.

In your workflow settings, please activate “Save execution progress” to “Yes”. With this setting on, n8n saves to database the result of each node executed.

Workflows stuck in running state can be seen if you tamper the URL by informing the execution ID, like execution/107 shows you execution ID 107.

In this scenario, you can force open the execution of a running workflow and by refreshing your browser window you can see the progress, as it gets loaded from the database for each node executed. This would allow you to pinpoint exactly what node is stuck.

I hope this also helps you finding the problem. Also, feel free to share your workflow so we can see what’s happening. Please remember to remove any possibly sensitive data.

I have the same issue with a workflow that only makes PostgreSQL queries using a few PostgreSQL nodes. Looking at the Bull documentation, it seems like reason for this error may be more likely this one:

Blockquote

  1. The Node process running your job processor unexpectedly terminates.

Since those nodes are not CPU intensive and should not block the CPU.

In our case we have deployed n8n workers in several containers using Kubernetes, so I looked at the upscale/downscale behavior and the issue correlates with a downscale of containers:

Screen Shot 2022-02-22 at 10.26.54 AM

This makes me believe that maybe the n8n workers do not properly handle the SIGTERM signal from the pod and the worker process is halted abruptly. Could this be possible @krynble ?

Hey @Miguel_Caballero_Pin

Interesting, theoretically n8n receives the SIGTERM and intercepts it, trying to gracefully shutdown, allowing 30 seconds for running executions to finish.

Is there a chance your executions take more than 30 seconds?

You can see the startup and shutdown code by browsing the following file: n8n/worker.ts at master · n8n-io/n8n · GitHub

You can see on line 227 where we register a shutdown function that is declared on line 72. It first stops receiving new requests and then enters the waiting for current executions.

Does your Kubernetes cluster allow some time for the pods to shut down? What is the timeout it provides?

2 Likes

Is there a chance your executions take more than 30 seconds?

Yes, those workflows take minutes to run.

Does your Kubernetes cluster allow some time for the pods to shut down? What is the timeout it provides?

That is a good point. We are not setting it up explicitly, which means it is defaulting to 30 seconds, same as what you have set up. It seems we do not have much options on our end. n8n could provide a parameter to set a custom termination grace period so we can increase it on both sides, n8n and kubernetes.

Something that may help also is being able to retry those jobs. Currently when a failure like that one happens jobs are not retryable.

Hey Miguel.

Unfortunately the timer is currently not configurable, it is hardcoded to 30 seconds. If you’re feeling like contributing, feel free to create a PR :slight_smile:

Regarding the retry issue, this is def a bug and will be added to our to-do list.

1 Like

Hi @krynble , I have any news on the retry capacity ?

In my case, I use N8N in Kubernetes and I want to use KEDA to autoscale the worker, but I need a way to retry jobs when workers were stopped before the end of the job.

Hey @sGendrot I’ve created a PR allowing you to set up the time n8n waits for running execution before exiting.

You can see the PR here . Should be released soon.

Please note that if it still happens that a worker process is unable to finish the execution in time, it stays locked for a while but then another worker is able to pick it up, so you’re never losing that information.

1 Like