Server crashing after 1 or 2 days running

I can get n8n to run with a basic webhook trigger and a few steps, but after leaving it online for a few days the server becomes unresponsive and I have to reboot it to get working again.

I’ve noticed that doing a TOP command in linux to see running processes, there ends up being a large amount of ‘node’ items running. This might be related to my problem.

How do we troubleshoot this to identify the cause? Server goes down after 1 or 2 days of being online. It seems like some sort of memory leak or processes not being terminated correctly.

Thanks!

If you look into executions, does it display that any keep on running forever?

Yes it does that eventually just before the server becomes unresponsive.

Then after we restart the server, it shows those ones as ‘error’ with a long runtime. (Maybe the long runtime is equal to the task start time to the server reset time I’m guessing).

Thanks again!!!

How do you start workflows? Do you maybe start some of them via the cron node or really all via Webhook?

We only have one active workflow and it has only one webhook with NO cron/scheduler.

Do you run workflow executions in the main process, or in a separate one?
We are not aware of any memory leak or similar. If there would be one we should have seen much more issues like yours. Normally does n8n only crash, when it runs out of memory. Is it possible that this is the case. Is there for some reason all 1-2 days there is maybe a huge spike in requests in a short time frame so that n8n starts too many workflows and that makes it then crash?

Do you run workflow executions in the main process, or in a separate one?

I am not aware of the ‘main process’ but its just a basic n8n docker setup and my workflow design is just a webhook then some google sheet etc. after the webhook.

When we ‘opened up the floodgates’ to have many more rapid requests to the webhook it does seem to crash faster, like within a few hours not within a day or two.

There doesn’t seem to be a major influx of these at the time of error, just a few per minute. Here is one block of errors with long run-times:

image

You can find information about the “main process execution” here:

If you set that environment variable accordingly it will greatly improve the performance. It will then be able to respond much faster to request, require less resources and so be able to handle more.

Anyway, looking at your screenshot would it still probably not totally fix your problem. Do you know why those jobs run so long? Is there any kind of loop in them or anything else which could cause the jobs not to stop?

Awesome thanks Jan will test this out and let you know how it goes.

1 Like