Stress / Load testing

I’m doing a few stress / load tests to check if we can put n8n in production in one of our projects. While this, I’m stumbling on a few details.

I create this post to annotate and share the things I think worth mention. I will post more details as I go and try to finish it after I finish my tests, so this is a WIP. I may open some bug reports or feature requests after to handle some of the issues.

Loading Executions

When you have a lot of executions in history (more than 3 million), it takes ages to load (like 10 min). Using Postgres(locally), because it has a better latency for executions than SQLite.
UPDATE: without the autorefresh on it’s faster, like just 50 secs ~ 1min30secs to load…

Workflow ID as int

Having the id of the workflow as int causes a lot of trouble. Every time I export and import a workflow that has “Execute Workflow” nodes I have to find out the new ids of each one and fix them in it. It’s a nightmare. Like, when you export all workflows, the “Execute Workflow” nodes points to the id of the sub-workflow you need. When you import all the workflows into another server, it seems to take the next id available. Even if the server is empty when it creates the new workflows if it originally skips a number because it was a deleted workflow, in the new set it will use the number, so the numbers get different.

Memory Leaks

There may be some memory leak in N8N. Using workflow based on real-life solutions, with more than 60 nodes and a few sub-workflows, the memory keeps increasing until node crashes with ‘out of memory’ (FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory). I disabled all custom nodes to check if it wasn’t my fault, but it doesn’t change the behavior. When I just stop the load tester when it’s with a high memory usage but hasn’t crashed yet, leaving it idle for a while, the memory usage keeps the same, so it’s unlikely to be the GC struggling.

Does anyone know tips of who to spot the memory leak :pleading_face:? Like tools that may help? My first bet would be on some node like Redis or MongoDB leaving open connections.

UPDATE: When using queue mode, that is one main process (with N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true), one process for webhooks, and a few processes for the workers, only the workers’ process have this problem. Not much but it eliminates a little bit of surface for searching. By the way, all the workers crashed in a short time in the test (6 in 30 min of run).

4 Likes

About the Memory Leak thing, does anyone have any tips to find it?

With the Memory Leak. Not sure I understand.
Are you saying that n8n executes the workflow a few times fine and then it starts crashing? Because then it sounds like a memory leak.
Or does the workflow simply always crashes? Because then it is probably expected behavior. If n8n takes up all the memory and it never gets freed up (because the workflow never finishes) the only thing it can do is crash.

I doing a stress / load test. So I’m making as much request as possible to a webhook.

When worker starts, it responds the requests fine, but as is goes responding requests, it’s memory usage increases. After a while (20min or 2h for example), this memory usage is so high that the node crashes due out of memory error.

The possibility that a requests still running is interesting, as I have a lot of problem with workflows that hang on some part, for example waiting for a rabbitmq server that is crashed, and it goes on until the max time of the workflow (30 min I believe). I will look for that, but in the lasts tests I did, I believe this didn’t happened, and it’s something else. How to look for it that is the hard part.

1 Like

About the Loading Executions, I opened a specific issue here:

Hi @AllanDaemon

Thank you for these tests. I will be checking the issues you reported with the executions list and maybe have a look at possible garbage collection issues.

The easiest way to diagnose this without many changes would be to simply check the executions list.

Considering it is taking too long to load, you can directly check the database in the execution_entity table. Search for rows where finished = false or finishedAt is null.

  • finished = false means the workflow ended with an error
  • finishedAt is null means the workflow is still running or crashed and did not notify back n8n’s main process, so we’re unsure of its state.

I will be investigating this issue and will return to you once I find something.

2 Likes

Thanks. I sometimes did load directly from the database.

As for the memory leak, I opened 2 issues with mode details and also did 1 PR to fix one of them.



2 Likes