Are there plans to make n8n work with big datasets?

SvenSylphen · May 19, 2023, 10:11am

Situation

First I want to clarify that I’m not trying to bad-mouth anything. n8n from what I have seen so far is an amazingly powerful tool for automation.

I’m currently evaluating whether n8n is something we could use for our integrations, but I more and more stumble upon issues when combining n8n with big datasets (> 200.000 entries). Please do correct me if any assumption here I make is wrong and could have been fixed by me missing that one specific setting changing it all.

Node / Typescript

I am all for browser applications, but I strongly doubt that processing larger data with for example the merge-function using node or the client-browser is a good idea. As soon as you work with larger datasets and you try to merge them you will obviously run into issues, as you are limited by the one-threaded nature of it all.

This also means the execution will always be extremely slow as soon as n8n needs to execute something itself like Code, Merge, … or the the browser straight up crashes due to the load when building the workflow. It is already very suspect to me that the browser application is it not lightweight at all and actually does seem to execute code and contain the data of the workflow at all times, which is entailing the need to introduce cloud- or on-premise services handling the heavy-lifting for n8n.

Also, looking at the merge-function again we can also see that its potentially going very, very deep into the stack when working with/on bigger data, prompting for the need to increase the stack-size in node. Otherwise we would run into stack size issues.

Consequences

Workflows are unstable and have a very high chance to run indefinitley. Trying to stop the workflow does not work.

There is a big chance the site crashes when trying to interact with it while it is executing code or merging.

Extremely long execution times

Node stack-size issues

Ghost workflows that can not be deleted anymore and do not resolve themselves.

Conclusion

To bring the post to a question, are there any active plans to better improve for scale or will n8n be a use-case for smaller workload only?

As of now it seems like a fundamental technical limitation that could only be changed with a major overhaul of the system and its inner workings and it’d be understandable if that is not a current focus at all.

Information on your n8n setup

n8n version: 0.227.1
Database (default: SQLite): SQLite
Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
Operating system: linux

Jon · May 19, 2023, 10:42am

Hey @SvenSylphen,

That is a good question, As it stands using a lot of data in one workflow can cause issues but these can be worked around by handling the data in smaller chunks using sub workflows, Trying to load a bunch of data in the editor though is not a great idea as it is all loaded into the browser and each tab only has a limited amount of memory it can use until the browser kills the tab but when a workflow is running on a schedule it works a lot better.

We are always looking into ways to improve performance so we can handle larger datasets our recent tweak was using streams for files rather than keeping them all in memory which won’t help with working with data from an API or a database but I would imagine we will improve that side of things in the future.

I am sure @jan or @sirdavidoff might have some thoughts on this as well.

SvenSylphen · June 16, 2023, 11:07am

@sirdavidoff @jan Hey, any more input on this?

I have by now circumvented most of the browser related issues by having conditional javascript in the SQL statements that limit the queries when executed via browser.

But while running intensive tasks on schedule does in fact run better, when running into failed executions I am most often unable to see the error message and have to debug by guess work, as I can not seem to open the workflow in the execution list to see the error message. Apparently when running into memory issues the error workflow does not get triggered either.

This is one of the problem children.
It transmits around 200.000 entries each with about 8 columns.
A recommendation was sub-workflows. But this is not the only workflow that is like this. Does that mean I am supposed to create an additional workflow for every workflow that has high item counts? That seems like a huge overhead of worklflows.

I have also tried using “Split In Batches”, but it does not change anything at all.
Does “Split In Batches” not free the memory of the batches that have been executed already?

Any input would be much appreciated.

system · September 14, 2023, 11:08am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.