Describe the problem/error/question
I have queue mode setup on an n8n instance using postgres, redis and two workers (one cloud based, one local). My instance was working extremely well until I updated all instances to the latest version. The previous conducted update was over a month ago in December.
I’m currently experiencing webhooks stopping to trigger after a while, sometimes hours, sometimes a day. This holds also true for the error workflows. When I turn the workflow off and on again the issue is resolved for the time being. I hope this is something encountered by others and there’s a fix.
What is the error message (if any)?
No error messages found.
Please share your workflow
Holds true for all workflows.
Information on your n8n setup
- **n8n version:latest
- **Database (default: SQLite):SQLite
- **n8n EXECUTIONS_PROCESS setting (default: own, main):default
- **Running n8n via (Docker, npm, n8n cloud, desktop app):Docker
- **Operating system:Railway
Welcome to the community
Can you explain your setup in a bit more detail? Are your workers just normal workers or are they webhook workers? This would help work out if the issue is potentially with the main instance, the load balancer or the workers.
I notice you have put the version down as
latest can you tell us what version of n8n you are using as
latest is just a label and the version behind it is updated weekly. You have also listed your database as
sqlite which won’t work for queue mode are you using MySQL or Postgres for the database?
I’m currently on 1.22.5. I did some more searching and found that one of the workers, failed to upgrade. In which case I suspect my workflows were containing nodes that weren’t available for the one worker that was left behind on the version (1.17). However this worker node crashed a while ago and I expected the setup to be “self-healing” (other worker node takes over). All nodes are on the same version now since today, and the only issue I’m still experiencing is error nodes not triggering on failed executions.
Could you comment on the self-healing part, since I was under the assumption that this meant the setup would continue processing as expected. However when the one node (docker container containing n8n) went down, some webhook events stopped processing. Apart from this small hick-up n8n and the team keeps suprising me, extremely happy with how everything is documented and how smooth everything sails.
If a worker is on an old version it is still going to connect to the database and the redis queue to try and pull jobs, If it crashes the other worker would pick up the job but the crashed worker would restart itself and the process would continue, As the workers don’t work in an active / passive style they are both active and will take jobs in turn from the queue. I guess if they did operate in an active / passive style if one stopped the other would take over but it wouldn’t be a great approach for distributing the load.
I have checked and we don’t currently check the worker version against the main instance version which allows the updates to be performed as you need to but this has a downside where if a new node was released and a worker was missed you will hit an issue.
We are now talking about ways to possibly solve this in the future but for now I guess the best I have is when you do your udates make sure all of the workers are updated at the same time.
Loud and clear.
Any known issues with error workflows not triggering in queue mode in this setup? This is the only issue I remain having. I assumed there may be an due to old versions of the workflows in the postgres database, so restarted everything and made new versions of the error workflows, without luck.
Well, error workflow execution also stopped working on my non queue mode instance, which is in no way connected to the other setup.
I am not aware of any issues with error workflows not triggering at the moment. I take it the workflows are configured to call your error workflow and the workflow is actually erroring?
No, it’s not triggering any error flows I configured in the settings of a workflow. Logs from any of the instances are also not pointing me in any direction. They triggered correctly till a certain date, and then stopped across the whole setup.
My issue seems to be similar to: Error Workflow not triggering in a multi container set up
That is a really old one, Let me check out internal instance which is running in queue mode to see if I get the same issue.
Amazing, in the mean time I will spin up another instance to see whether it may have been caused with the disconnect that happened with one of the workers.
I set up a clean instance (postgress, redis, worker, organiser) and I’m having the exact same issue.
I have got round to testing on a normal n8n instance and it appears to be ok, Going to test with our internal instance in the morning to see what happens.
Oddly we did have another report of the error trigger not working but that was on a single instance so I am not fully sure what is causing it or why I can’t reproduce it at the moment. I suspect there is some small bit of information I have missed that is important.
Appreciate the attention to the issue. You should be able to reproduce this issue by spawning an instance on Railway from the template “N8N”, I tried various error setups that used to trigger, but also simpler ones.
This should be fixed from 1.24.0 can you try a newer release and let me know if the issue is resolved.
Going to try it out, will keep you posted @Jon
Working smoothly, amazing!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.