Full Data Pipelines with raw data loading and ETL

Hello all,
I was just looking for feedback on my idea, as I do not have any other person to speak about this idea with knowledge in n8n and data engineering.

At my company we are currently up to collecting sports data from different APIs, but we do not have any kind of DWH or data lake or so on, meaning I am thinking about creating a medallion layer structure with loading from various APIs into the first layer as raw data and then proceeding the data via ETL processes into the next layers/databases.

Now I was thinking of creating all of these steps via n8n. Having a scheduled workflow that loads raw data during the night, which then starts up the ETL workflow after finishing. In a smaller environment, I am actually doing that already, but just with small datasets.

The initial data to load would probably be a little bit which will come to around 3 digits in GB, afterward it will come to peaks of probably not even 100MB a day with some days peaks of a little bit more than that.

I was just wondering if anybody has experience in that and also just wanted to write it down, as I am still not quite sure if I should take that step and focus on using n8n for that or look for other tools that are more “common” in the industries.

Sorry for bad english and thanks for reading.

Hi @Schlech

Sounds like a nice project. :slight_smile:
I don’t have experience with that much data I think, but yeah it should be possible. As long as you cut the data up into smaller bits and make sure you assign enough resources.
Just make sure to make smaller reusable parts. And try to standardise data when getting it so you do not have to create all flavours of data flows from scratch. :slight_smile: But I think you already described something like that.

Hi @BramKn

Thanks for the feedback, already encountered problems with something like you described, where I did a workflow a little bit too long and the n8n crashed from time to time.
Split it exactly like you said into smaller workflows, and it started to work, especially with the RAM used from n8n.

I am currently at the staging area of my first part of incoming data, and it works fine until now. The Job runs currently already around 1h and is estimated to run for around 2,5h until everything is forwarded to the Staging DB.
Now I was wondering during waiting, what would be the best way (if possible) to speed up the processing time of nodes, when looking into the hardware?
I have installed it via npm on a dedicated server, and was wondering if I put more RAM into the server, would it then possibly speed up the processes? Or are there other parts that would be more important for that?

Hi @Schlech

n8n is single threaded, so speeding it up can only be done really by adding workers (which all can use 1 thread) and splitting the load(something like RabbitMQ). Also making sure the concurrency is not set too high for the workers.
With memory it is basically that you have enough or you don’t I think. Maybe it slows down if you are on the limit, but other than that it shouldn’t have a performance impact as far as I know.

I always run in Docker, as it is a lot easier to deal with than NPM. In Docker you do need to set an extra ENV variable to use more than 4GB of RAM, but don’t think you need that with an NPM install.

Feel free to send me a DM if you would like to use my consulting services to optimise workflows and set up.