Stress / Load testing

AllanDaemon · March 18, 2021, 1:04pm

I’m doing a few stress / load tests to check if we can put n8n in production in one of our projects. While this, I’m stumbling on a few details.

I create this post to annotate and share the things I think worth mention. I will post more details as I go and try to finish it after I finish my tests, so this is a WIP. I may open some bug reports or feature requests after to handle some of the issues.

Loading Executions

When you have a lot of executions in history (more than 3 million), it takes ages to load (like 10 min). Using Postgres(locally), because it has a better latency for executions than SQLite.
UPDATE: without the autorefresh on it’s faster, like just 50 secs ~ 1min30secs to load…

Workflow ID as int

Having the id of the workflow as int causes a lot of trouble. Every time I export and import a workflow that has “Execute Workflow” nodes I have to find out the new ids of each one and fix them in it. It’s a nightmare. Like, when you export all workflows, the “Execute Workflow” nodes points to the id of the sub-workflow you need. When you import all the workflows into another server, it seems to take the next id available. Even if the server is empty when it creates the new workflows if it originally skips a number because it was a deleted workflow, in the new set it will use the number, so the numbers get different.

Memory Leaks

There may be some memory leak in N8N. Using workflow based on real-life solutions, with more than 60 nodes and a few sub-workflows, the memory keeps increasing until node crashes with ‘out of memory’ (FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory). I disabled all custom nodes to check if it wasn’t my fault, but it doesn’t change the behavior. When I just stop the load tester when it’s with a high memory usage but hasn’t crashed yet, leaving it idle for a while, the memory usage keeps the same, so it’s unlikely to be the GC struggling.

Does anyone know tips of who to spot the memory leak ? Like tools that may help? My first bet would be on some node like Redis or MongoDB leaving open connections.

UPDATE: When using queue mode, that is one main process (with N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true), one process for webhooks, and a few processes for the workers, only the workers’ process have this problem. Not much but it eliminates a little bit of surface for searching. By the way, all the workers crashed in a short time in the test (6 in 30 min of run).

AllanDaemon · March 19, 2021, 1:50pm

About the Memory Leak thing, does anyone have any tips to find it?

jan · March 19, 2021, 2:11pm

With the Memory Leak. Not sure I understand.
Are you saying that n8n executes the workflow a few times fine and then it starts crashing? Because then it sounds like a memory leak.
Or does the workflow simply always crashes? Because then it is probably expected behavior. If n8n takes up all the memory and it never gets freed up (because the workflow never finishes) the only thing it can do is crash.

AllanDaemon · March 19, 2021, 2:57pm

I doing a stress / load test. So I’m making as much request as possible to a webhook.

When worker starts, it responds the requests fine, but as is goes responding requests, it’s memory usage increases. After a while (20min or 2h for example), this memory usage is so high that the node crashes due out of memory error.

The possibility that a requests still running is interesting, as I have a lot of problem with workflows that hang on some part, for example waiting for a rabbitmq server that is crashed, and it goes on until the max time of the workflow (30 min I believe). I will look for that, but in the lasts tests I did, I believe this didn’t happened, and it’s something else. How to look for it that is the hard part.

AllanDaemon · March 25, 2021, 10:56pm

About the Loading Executions, I opened a specific issue here:

https://github.com/n8n-io/n8n/issues/1578

krynble · March 26, 2021, 9:01am

Hi @AllanDaemon

Thank you for these tests. I will be checking the issues you reported with the executions list and maybe have a look at possible garbage collection issues.

The easiest way to diagnose this without many changes would be to simply check the executions list.

Considering it is taking too long to load, you can directly check the database in the execution_entity table. Search for rows where finished = false or finishedAt is null.

finished = false means the workflow ended with an error
finishedAt is null means the workflow is still running or crashed and did not notify back n8n’s main process, so we’re unsure of its state.

I will be investigating this issue and will return to you once I find something.

AllanDaemon · March 26, 2021, 11:14am

Thanks. I sometimes did load directly from the database.

As for the memory leak, I opened 2 issues with mode details and also did 1 PR to fix one of them.

github.com/n8n-io/n8n

Memory Leak in Redis Node

opened 12:37AM - 26 Mar 21 UTC

closed 06:02PM - 26 Mar 21 UTC

AllanDaemon

**Describe the bug** The memory of the worker process grows up and never shrink…s, usually crashing with with "Out of memory" error. It seems to be due to connections left open by the Redis node. This is another issue that I found in my load tests (https://community.n8n.io/t/stress-load-testing/4846). I have a workflow with webhook that when I execute it, the worker processes memory grows over time. It never decreases significantly, even after an explicit call to the GC (start worker in node with `--expose-gc`, in a webhook run a function item that calls `global.gc();`) Let me tell my references: my computer has 8GB of ram. I usually monitor it using the `htop` command, looking at the `MEM%` column, which is the amount of memory usage relative to the total. When I start a worker process, it usually uses 2.5%. When in load execution webhook workflows (like 10 requests per second or more), it may increase the usage up to 5%, but never gets over this. It fluctuates, because of the allocations and the garbage collector(GC). After the requests stop coming, it comes back to the 2.5% baseline. When it has memory leaks, it just keeps increasing, and I usually stop with 18% - 20% of total memory usage to avoid crashing. Even after all requests stop, it keeps the usage this high, even after the forced garbage collection. I tried to use traditional tools for figuring it out, but with no luck: trying to inspect the heap memory with inspector always crashes the node process of the worker (https://github.com/nodejs/node/issues/37878). I was able to get a heap dump when not using the queue process, just one process with `EXECUTIONS_PROCESS=main`, taking a dump using the [heapdump](https://github.com/bnoordhuis/node-heapdump) library just after the process start and after a while of execution, but I couldn't understand or get any insight from that, except it was a lot of allocations in the `system` and `closures` category. I took a different path and started to strip down the workflow that was able to reproduce it. First, I tested a lot of different simple workflows to check if any of the nodes _Webhook_, _Set_ or _Function Item_ could cause the issue. The following workflow, with a lot of sets and functions with sleep, kept running for hours and the memory usage was kept normal, so I concluded that they were safe. ![](https://community.n8n.io/uploads/default/optimized/2X/a/a5f04bbc83726ca2c4edd7a8b38daa4328625f84_2_1035x582.png) The sleep code in the _Function Item_ node is this: `await new Promise(r => setTimeout(r, 500)); return {};`. It's asynchronous, so I do not block JS event loop. With that, I started to strip down the offending workflow, replacing nodes with potential to trouble, like the _Execute Workflow_ node, with a sleep node using the same amount of time observed in the wild, so it kept the workflow as similar as possible and creating mock nodes that return an example of data in the place of external calls. I finally found out that the Redis node is likely to be the problem. I even created a simple workflow where it just set and read from Redis, and quickly I got an error with maximum connections in Redis server. As I mentioned in the community forum, I suspected that I could be something like I did before in a custom node: I created connections to a MongoDB server and expected that at the end of the function they would be closed, but that wasn't the case. When the MongoDB server raised an alarm with the number of maximum connections above 90%, I suspected it could be something like that, and it was confirmed when I close the N8N process and the number dropped from many hundreds to a dozen. I suspect that something similar could be happening in the RabbitMQ node as well, as I have some issues with it in the long runs that I couldn't reproduce, but now I know where to look. I also suspect the while the connection is kept open, due to the closures, it keeps referencing other data besides the connections that never get released. This correlates with the memory heap dump observations. After this issue, I'll try to check the code, find the bug and fix it, so hopefully, the next interaction of mine here will be a reference from a PR. Sorry if this took long, but I thought it was important to share the details of how I got there instead of just pointing out the results. I hope this can be helpful somehow. **Environment:** - OS: Fedora Linux 33 (x86_64) - n8n Version: 0.112.0 - Node.js Version: v14.16.0 - Postgresql: 12.6 - Relevant envs: - `DB_TYPE=postgresdb` - `EXECUTIONS_PROCESS=main` - `EXECUTIONS_MODE=regular`

github.com/n8n-io/n8n

Connection leak in RabbitMQ Node in Webhook Executions

opened 02:00AM - 26 Mar 21 UTC

closed 04:27PM - 13 Apr 21 UTC

AllanDaemon

As I said in the previous issue https://github.com/n8n-io/n8n/issues/1583, I sus…pected that the Redis node doesn't close its connections after an execution. It seems to be the case, keeping the connections open until something goes wrong with it. In my case, the RabbitMQ server does not handle it well and just hangs the server, so I have to reboot... But, this only happens when it is a webhook triggered execution. Looking at the connections open number for the RabbitMQ server, when I execute a workflow manually (like in the one below, with 10 sequential RabbitMQ nodes), it opens and closes the executions correctly. When it happens inside an execution triggered by a webhook, it doesn't close the connections until I close the worker process that handled the executions. I didn't test the case for other triggered executions or integrated ones, but It's probably one of those 2 cases. **To Reproduce** 1. Setup a RabbitMQ server and the credentials 2. Use the workflow below to test it, keeping an eye on the view of the open connections of the RabbitMQ UI 2.1 You may need to adjust and create the exchanges and queues to match the workflow. 3. Test it manually, and it will behave correctly 4. Test it calling from webhook, and the connections will keep open. 4.1 To run many times, you can use the following shell one-liner (on bash or zsh): `for i in {1..50}; do echo $i; curl http://n8n-server/webhook/rabbitmq-test &; done` Change the number of times and the URL. `http` (httpie) can be used instead of curl. ```json { "name": "Test / webhook RabbitMQ", "nodes": [ { "parameters": {}, "name": "Start", "type": "n8n-nodes-base.start", "typeVersion": 1, "position": [ 300, 400 ] }, { "parameters": { "path": "rmq", "responseMode": "lastNode", "responseData": "allEntries", "options": {} }, "name": "Webhook", "type": "n8n-nodes-base.webhook", "typeVersion": 1, "position": [ 300, 200 ], "webhookId": "f8e9855d-3b50-4770-bda2-f772cf5f51d8" }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 500, 200 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ1", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 650, 200 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ2", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 800, 200 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ3", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 950, 200 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ4", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 1100, 200 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ5", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 500, 400 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ6", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 650, 400 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ7", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 800, 400 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ8", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 950, 400 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true }, { "parameters": { "mode": "exchange", "exchange": "test.dev.null", "exchangeType": "direct", "routingKey": "sender", "options": { "arguments": { "argument": [ {} ] } } }, "name": "sentToRMQ9", "type": "n8n-nodes-base.rabbitmq", "typeVersion": 1, "position": [ 1110, 400 ], "credentials": { "rabbitmq": "RabbitMQ DEV" }, "continueOnFail": true } ], "connections": { "Webhook": { "main": [ [ { "node": "sentToRMQ", "type": "main", "index": 0 } ] ] }, "Start": { "main": [ [ { "node": "sentToRMQ", "type": "main", "index": 0 } ] ] }, "sentToRMQ": { "main": [ [ { "node": "sentToRMQ1", "type": "main", "index": 0 } ] ] }, "sentToRMQ1": { "main": [ [ { "node": "sentToRMQ2", "type": "main", "index": 0 } ] ] }, "sentToRMQ2": { "main": [ [ { "node": "sentToRMQ3", "type": "main", "index": 0 } ] ] }, "sentToRMQ3": { "main": [ [ { "node": "sentToRMQ4", "type": "main", "index": 0 } ] ] }, "sentToRMQ4": { "main": [ [ { "node": "sentToRMQ5", "type": "main", "index": 0 } ] ] }, "sentToRMQ5": { "main": [ [ { "node": "sentToRMQ6", "type": "main", "index": 0 } ] ] }, "sentToRMQ6": { "main": [ [ { "node": "sentToRMQ7", "type": "main", "index": 0 } ] ] }, "sentToRMQ7": { "main": [ [ { "node": "sentToRMQ8", "type": "main", "index": 0 } ] ] }, "sentToRMQ8": { "main": [ [ { "node": "sentToRMQ9", "type": "main", "index": 0 } ] ] } }, "active": true, "settings": {}, "id": "16" } ``` **Environment:** - OS: Fedora Linux 33 (x86_64) - n8n Version: 0.112.0 - Node.js Version: v14.16.0 - Postgresql: 12.6 - Relevant envs: - `DB_TYPE=postgresdb` - `EXECUTIONS_PROCESS=main` - `EXECUTIONS_MODE=regular`

Josh-Ghazi · July 25, 2022, 7:22am

This isn’t that much, but i managed to load up 8gb ram and 24gb of swap file doing a 4000 file sync job.

Its not going to happen again, but it would be nice to see some kind of memory limiting feature that will queue other jobs until the pending ones are completed.

azngeek · August 9, 2022, 7:25am

Interesting read. I am currently doing the same evaluation and hoped that it will perform better then node-red. This is my original question: Known limitations in scaling the number of workflows per instance?

Miquel_Colomer · September 12, 2022, 5:04am

I had memory leaks when working with lots of:

Files.
HTTP requests.

I have tested successfully the next alternatives to fix memory leaks:

Instead of saving files in memory, save them in the filesystem with the variable:
N8N_DEFAULT_BINARY_DATA_MODE=filesystem
If you send lots of HTTP requests, this will consume memory/database with n8n. Replace HTTP requests with curl (install with apk) using Execute Command. It avoids saving the status of HTTP requests in the database per every execution.
Purge database history.
Scale your n8n following the next guide Overview - n8n Documentation

Jon · September 12, 2022, 9:46am

Hey @Miquel_Colomer,

Can you share any more information on the memory leaks you had with the HTTP Request node? I am not aware of any so I would like to dig into that a bit more.

Miquel_Colomer · September 12, 2022, 9:56am

Hi @Jon ,

I remember that I launched huge amounts of http requests (more than 1k, 10 parallel), and my n8n went down.

I decided to move http requests to curl to avoid the problem.

Additionally, I decided to migrate part of the flow to Lambda to improve performance.

Jon · September 12, 2022, 10:05am

Hey @Miquel_Colomer,

Any chance the HTTP Requests was not down to a memory leak and was down to an under resourced server or scaling mode not being used?

Miquel_Colomer · September 12, 2022, 10:09am

Yes. But 100% not sure (it was long time ago).

I didn’t use the queue mode to avoid extra infrastructure setup by myself.

I know that one n8n has its limits

Jon · September 12, 2022, 10:12am

You are not wrong there one n8n will have it’s limits but it all comes down to the resources available. n8n is fairly heavy on memory usage and you use own mode it will use more than main but won’t be as resilient.

Miquel_Colomer · September 12, 2022, 10:17am

Absolutely.

In my case, resources were not a problem (64GB ram).

But as you said, n8n consumes resources. Perhaps it could be nice to stress it to understand its limits and detect possible leaks (I think somebody did that).