Stress / Load testing

I’m doing a few stress / load tests to check if we can put n8n in production in one of our projects. While this, I’m stumbling on a few details.

I create this post to annotate and share the things I think worth mention. I will post more details as I go and try to finish it after I finish my tests, so this is a WIP. I may open some bug reports or feature requests after to handle some of the issues.

Loading Executions

When you have a lot of executions in history (more than 3 million), it takes ages to load (like 10 min). Using Postgres(locally), because it has a better latency for executions than SQLite.
UPDATE: without the autorefresh on it’s faster, like just 50 secs ~ 1min30secs to load…

Workflow ID as int

Having the id of the workflow as int causes a lot of trouble. Every time I export and import a workflow that has “Execute Workflow” nodes I have to find out the new ids of each one and fix them in it. It’s a nightmare. Like, when you export all workflows, the “Execute Workflow” nodes points to the id of the sub-workflow you need. When you import all the workflows into another server, it seems to take the next id available. Even if the server is empty when it creates the new workflows if it originally skips a number because it was a deleted workflow, in the new set it will use the number, so the numbers get different.

Memory Leaks

There may be some memory leak in N8N. Using workflow based on real-life solutions, with more than 60 nodes and a few sub-workflows, the memory keeps increasing until node crashes with ‘out of memory’ (FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory). I disabled all custom nodes to check if it wasn’t my fault, but it doesn’t change the behavior. When I just stop the load tester when it’s with a high memory usage but hasn’t crashed yet, leaving it idle for a while, the memory usage keeps the same, so it’s unlikely to be the GC struggling.

Does anyone know tips of who to spot the memory leak :pleading_face:? Like tools that may help? My first bet would be on some node like Redis or MongoDB leaving open connections.

UPDATE: When using queue mode, that is one main process (with N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true), one process for webhooks, and a few processes for the workers, only the workers’ process have this problem. Not much but it eliminates a little bit of surface for searching. By the way, all the workers crashed in a short time in the test (6 in 30 min of run).

5 Likes

About the Memory Leak thing, does anyone have any tips to find it?

With the Memory Leak. Not sure I understand.
Are you saying that n8n executes the workflow a few times fine and then it starts crashing? Because then it sounds like a memory leak.
Or does the workflow simply always crashes? Because then it is probably expected behavior. If n8n takes up all the memory and it never gets freed up (because the workflow never finishes) the only thing it can do is crash.

I doing a stress / load test. So I’m making as much request as possible to a webhook.

When worker starts, it responds the requests fine, but as is goes responding requests, it’s memory usage increases. After a while (20min or 2h for example), this memory usage is so high that the node crashes due out of memory error.

The possibility that a requests still running is interesting, as I have a lot of problem with workflows that hang on some part, for example waiting for a rabbitmq server that is crashed, and it goes on until the max time of the workflow (30 min I believe). I will look for that, but in the lasts tests I did, I believe this didn’t happened, and it’s something else. How to look for it that is the hard part.

1 Like

About the Loading Executions, I opened a specific issue here:

https://github.com/n8n-io/n8n/issues/1578

Hi @AllanDaemon

Thank you for these tests. I will be checking the issues you reported with the executions list and maybe have a look at possible garbage collection issues.

The easiest way to diagnose this without many changes would be to simply check the executions list.

Considering it is taking too long to load, you can directly check the database in the execution_entity table. Search for rows where finished = false or finishedAt is null.

  • finished = false means the workflow ended with an error
  • finishedAt is null means the workflow is still running or crashed and did not notify back n8n’s main process, so we’re unsure of its state.

I will be investigating this issue and will return to you once I find something.

2 Likes

Thanks. I sometimes did load directly from the database.

As for the memory leak, I opened 2 issues with mode details and also did 1 PR to fix one of them.

2 Likes

This isn’t that much, but i managed to load up 8gb ram and 24gb of swap file doing a 4000 file sync job.

Its not going to happen again, but it would be nice to see some kind of memory limiting feature that will queue other jobs until the pending ones are completed.

1 Like

Interesting read. I am currently doing the same evaluation and hoped that it will perform better then node-red. This is my original question: Known limitations in scaling the number of workflows per instance?

2 Likes

I had memory leaks when working with lots of:

  • Files.
  • HTTP requests.

I have tested successfully the next alternatives to fix memory leaks:

  • Instead of saving files in memory, save them in the filesystem with the variable:
    N8N_DEFAULT_BINARY_DATA_MODE=filesystem
  • If you send lots of HTTP requests, this will consume memory/database with n8n. Replace HTTP requests with curl (install with apk) using Execute Command. It avoids saving the status of HTTP requests in the database per every execution.
  • Purge database history.
  • Scale your n8n following the next guide Overview - n8n Documentation
2 Likes

Hey @Miquel_Colomer,

Can you share any more information on the memory leaks you had with the HTTP Request node? I am not aware of any so I would like to dig into that a bit more.

Hi @Jon ,

I remember that I launched huge amounts of http requests (more than 1k, 10 parallel), and my n8n went down.

I decided to move http requests to curl to avoid the problem.

Additionally, I decided to migrate part of the flow to Lambda to improve performance.

Hey @Miquel_Colomer,

Any chance the HTTP Requests was not down to a memory leak and was down to an under resourced server or scaling mode not being used?

Yes. But 100% not sure (it was long time ago).

I didn’t use the queue mode to avoid extra infrastructure setup by myself.

I know that one n8n has its limits :wink:

1 Like

You are not wrong there one n8n will have it’s limits but it all comes down to the resources available. n8n is fairly heavy on memory usage and you use own mode it will use more than main but won’t be as resilient.

1 Like

Absolutely.

In my case, resources were not a problem (64GB ram).

But as you said, n8n consumes resources. Perhaps it could be nice to stress it to understand its limits and detect possible leaks (I think somebody did that).

1 Like