[Queue Mode] Worker stops consuming new messages

Shahtaj_Khalid · June 26, 2023, 6:22pm

Hi,

I’m using queue mode in n8n, running main server and worker as a separate pod via docker/k8s.
I have noticed that after some time (few hours or sometimes in a day, it’s not deterministic ), Worker pod (bull) stops consuming new messages, even though jobs do exists in the queue, I have checked in redis. When I re-start my worker pod, it starts consuming those jobs again. There are no errors in the ‘error’ event, but sometimes, I observed below error in my worker after few more hrs, once this error is received, worker starts consuming the messages again:

 Error from queue:  Error: read ECONNRESET
    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

This is how I’m initialising bull, and activating process method, in worker.ts:

Worker.jobQueue = new Bull(jobName, { prefix, redis: redisOptions, enableReadyCheck: false, settings: { maxStalledCount: 30 } });

Worker.jobQueue.process(flags.concurrency, async (job) =>
        this.runJob(job),
);
...

async runJob(job: Bull.Job): Promise<IBullJobResponse> {
// some code
       return {
        success: true,
    };
}

Since I’m not receiving any error event when worker stops consuming new messages, it’s hard to debug this, kindly let me know what could possibly trigger this issue and can I fix this.

Important NOTE : we are not using latest n8n’s version. Instead, have pulled the relevant code for queue mode and all related changes. Also, it works fine when I’m using n8n’s queue mode directly via npm locally via following command (./packages/cli/bin/n8n worker), But facing issue when running via docker/kubernetes only, which is how we are running and needs to run n8n mainly.

Versions in use:

bull : 4.10.2
ioredis : 5.2.4
nodeJs: 14.15

MutedJam · June 27, 2023, 10:56am

@krynble, can you perhaps help with this one? Thank you

Shahtaj_Khalid · June 28, 2023, 4:05pm

I have opened an issue with bull, and have added more details as I’m able to narrow down the issue, can you guys please look into it as well : Bull process stops consuming new messages · Issue #2612 · OptimalBits/bull · GitHub?

Thanks.

krynble · July 5, 2023, 2:16pm

Hey @Shahtaj_Khalid thanks for reporting this.

Do you have this part of the code also reproduced?

Also might be useful to set N8N_LOG_LEVEL to debug, maybe you’ll get some more log messages.

This is an odd error and behaviour indeed, as crashes when connecting to Redis should cause the worker to exit and then k8s would restart your pod. This looks like a network issue.

I hope the above considerations help you.

Shahtaj_Khalid · July 5, 2023, 2:38pm

Yes, I have also implemented that error listener in out code, and that’s the point which catches this error after few hours, not right away, but during those error, process method remains on hold and no new message gets consumed :

Error from queue:  Error: read ECONNRESET
    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

I tracked via redis MONITOR, that the command (brpoplpush) which Bull send internaly, on every processTick, doesn’t get issued anymore during that timeframe, Details added here : Bull process stops consuming new messages · Issue #2612 · OptimalBits/bull · GitHub.

Also, with network issue, redis should still recover right? since any other request on redis continues to work fine after that, only process method stops. The odd bit is, it’s works fine when running via npm, but encounters the issue while running via docker.

Please let me know if you need more info on this, to help find the issue. thanks.

krynble · July 17, 2023, 8:08am

Hey @Shahtaj_Khalid yes Redis should recover automatically but this does not seem to be happening. If you’re running a setup that allows container recreation, maybe you can just forcefully exit (by calling process.exit(1) and then the worker would restart. Not an ideal solution I know but given it’s not something we can reliably reproduce it’s hard for us to fix.

system · October 15, 2023, 8:09am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.