Queue mode error from workers

Describe the problem/error/question

I recently implemented queue mode in my Kubernetes cluster, which is quite robust in terms of resources. I anticipated this would address some ongoing issues I had been having. I’m unclear about the expected behavior on how it should work but these are some of my observations/concerns so far:

  1. Delayed Feedback from Worker Nodes: There’s a noticeable lag in the worker nodes reporting back to the main instance. The execution history initially shows no activity, but then updates to display completed tasks abruptly, without any indication of them running in real-time.
  2. Inconsistent Workers Screen Display: Accessing the Workers section under settings sometimes shows no data. Extremely intermittent.
  3. Error in Logs: In the Google Cloud logs for the workers, I frequently encounter the following error: TypeError: Cannot read properties of undefined (reading 'publishToWorkerChannel'), originating from { file: 'LoggerProxy.js', function: 'exports.error' }. This seems to suggest an issue with message passing or event handling.
  4. Unexpected Failures in Worker Nodes: Since the queue mode setup, there’s been an uptick in random failures from the worker nodes, diverging from their previous stability. When I say random I mean before queue mode I would have failures in a particular workflow in a particular node. Now that same workflow will just fail at random nodes complaining about being OOM.

I’ve attempted to mitigate these issues by increasing the memory allocation for the worker nodes, but it hasn’t had the desired effect. Any insights into whether these behaviors are typical or indicative of underlying problems would be greatly appreciated. Guidance on how to adjust or what to investigate further would also be helpful.

What is the error message (if any)?

TypeError: Cannot read properties of undefined (reading ‘publishToWorkerChannel’) “{ file: ‘LoggerProxy.js’, function: ‘exports.error’ }”

Please share your workflow

Not really a workflow to share as this is a queue mode specific issue. But I can make one if you guys really need one.

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version: self-hosted latest and greatest
  • Database (default: SQLite): Postgres
  • n8n EXECUTIONS_PROCESS setting (default: own, main): queue
  • Running n8n via (Docker, npm, n8n cloud, desktop app): Gcloud in a kubernetes cluster
  • Operating system:

Hey @tipsykat,

In the version you have put “self-hosted latest and greatest” can you be more specific as we release weekly and also have a latest docker tag so this could be any version really.

To run through the issues quickly…

  1. I would expect there to be some kind of delay as the worker doesn’t talk to the main instance it provides the updates in Redis and the database and the main instance picks it up from there. Having said that in my own queue mode instance I don’t see any noticable delay, what sort of delay are you seeing in seconds ideally?

  2. The worker screen did have an issue a few versions back but that was resolved, This view I believe connects to Redis rather than to the workers to display the information so at the moment based on the first issue and this one it sounds like there could be an issue talking to Redis

  3. This looks to further suggest there could be an issue with the communication to Redis.

  4. If it is failing about being OOM I would check the worker logs to confirm, It could be that it has ran out of memory or it could be an issue with CPU resources not being available or there could be an issue in the workflow exection. The log will normally have the extra information, If not try enabling debug logging.

Based on the above I would start looking into the Redis instance and make sure that there are no routing issues and that the resources are happy on it.

Jon,

Appreciate your response. I initially reported issues on version 1.28, but I’ve since upgraded to the latest, now at 1.31.

Here’s a quick rundown:

  • Delay in Execution List: I’m seeing about a 10-second lag for jobs from workers to appear in the execution list, even though these workflows are quick, finishing in 3-4 seconds. Adjusting QUEUE_RECOVERY_INTERVAL to 15000 didn’t seem to alter this behavior.
  • Cloud vs. Self-hosted Redis: My setup uses the cloud version of Redis. Considering the ongoing issues, I’m beginning to question if this might be contributing to the problem. Do you know if any of this has been tested with the cloud version?
  • Workers Screen Issue Post-Upgrade: After the upgrade to 1.31, the workers’ screen is stuck on a loading spinner without loading any data at all. I’m going to test the addition of a self-hosted Redis to my cluster as a potential fix.
  • OOM Error Resolution: The previously mentioned OOM issue was resolved by addressing an endpoint response that exceeded the 15mb limit, corrected by increasing the N8N_PAYLOAD_SIZE_MAX variable. This has led me to believe that the OOM errors might be more generic, catch-all error message than specifically indicative of memory issues. It would be really helpful to bubble the actual error into the workflow execution log, is there a reason the application doesn’t do that?
  • Configuration Clarification Needed: I’ve duplicated the configuration of the worker deployments with the main instance, specifically the settings for the Postgres database, Redis instance, and encryption key, as per the documentation’s recommendations. Additionally, I’ve included the resource limits and persistent volume configurations in the worker deployment. And then there’s a lot I didn’t copy over.

This is where most of my confusion comes from – the documentation specifically mentions that worker and main instances must have the same “access” to the database and Redis, but what does “access” entail? Does it merely mean network access within a distributed setup? Given that all nodes in my GKE cluster can access what the main instance can, I’m not sure the impact of replicating or omitting specific environment variables in the workers’ configuration has on the set up.

The documentation is pretty ambiguous on whether workers inherently inherit certain main instance configuration settings or if additional explicit configuration beyond the encryption key is required.

I think adding a detailed Docker file example with the appropriate environment variables for running workers to the docs would help a lot to clear some of this up.

I set up a redis instance in my GKE cluster and I’m still getting this error and tons of jobs are failing.

This all I am seeing the logs:
2024-03-10T01:35:25.882Z | error | TypeError: Cannot read properties of undefined (reading ‘publishToWorkerChannel’) “{ file: ‘LoggerProxy.js’, function: ‘exports.error’ }”

When I run them in the main instance, they run with no problem at all. Plenty of resources to go around for everyone as I am running some pretty beefy hardware right now.

Hey @tipsykat,

Don’t forget when you are running from the main instance it isn’t going to be using the workers or Redis.

Are you using a Google managed instance of Redis or just running it as a container, In theory Redis is Redis but it would be worth making sure it is Redis and not a Redis compatible service.

The workers only really need the database settings to read the workflows, Redis for the queue, the encryption key and any keys relating to workflows like node blocking or npm packages that are allowed. Duplicating env options won’t cause any issues so you won’t need to worry about that.

The logs will show workflow specific error but the ui will mostly show the node errors so both should be checked if there are issues, in the future this may change.

Jon,

I’m using a Redis managed Redis instance, but I believe it was deployed to a Google Cloud, just not mine. The company signed up through Redis’ website directly for it.

Also, I did try a local instance of Redis in the cluster and that didn’t seem to help either. However, I didn’t give it long to run because it was a lot of red while it was on.

I have to get a dev instance of n8n stood up before I’ll mess with the queue again, my original issue was definitely all payload size and has been resolved, and with it most my execution issues. The new hardware upgrade I got approved because of the queue issues I was experiencing doesn’t hurt either :slight_smile:

I am going to go ahead and close this for now, once I am back in queue mode I will post for help if these pop back up again.

Thanks for the help!

1 Like

Hey @tipsykat,

Sounds good to me.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.