How to work with large file sizes in n8n? (Downloading into n8n & then uploading to a cloud storage service)

buildingthings · September 11, 2023, 1:09am

Describe the problem/error/question

I am auto-generating videos at scale. I have 70 videos right now that are about 6-8mb each, and I made my workflow work perfectly when I did this for one video. The problem is that n8n times out when I try to run the node for 70 videos, so I’m trying to understand what the size and memory limitations of the service are and how to work with large file sizes. This also worked when I uploaded 5 videos at a time, but I just don’t know what I don’t know about working with large files and would love any help and insights for how to stop this process from timing out.

What is the error message (if any)?

Timeout

Please share your workflow

Share the output returned by the last node

Information on your n8n setup

n8n version: 1.41
Database (default: SQLite):
n8n EXECUTIONS_PROCESS setting (default: own, main):
Running n8n via (Docker, npm, n8n cloud, desktop app): Docker
Operating system:

MutedJam · September 12, 2023, 8:03am

I am auto-generating videos at scale. I have 70 videos right now that are about 6-8mb each, and I made my workflow work perfectly when I did this for one video. The problem is that n8n times out when I try to run the node for 70 videos

Hi @buildingthings, I am very sorry you’re having trouble.

To avoid hitting any resource limits when processing a large number of files I suggest you split your workflow into two separate workflows here: One “parent” (fetching your Sheet with the individual URLs) and one “child” (doing the heavy lifting of first downloading a file, then uploading it to your Google Cloud Storage).

You can then use the Split In Batches node in your parent workflow and split your data into small batches of maybe 5 URLs at a time. Your parent would then call the child workflow through the Execute Workflow node.

The advantage of this approach is that all resources required by the child workflow execution would become available again after each child workflow execution finishes, provided you only return a very small (or possibly empty) result to the parent. So instead of having to keep 70 videos in memory at once, your n8n instance only needs to keep 5 videos in memory at once.

Here’s how this could look like:

Parent workflow

Child workflow

On a slightly related note, you probably want to make sure to set the N8N_DEFAULT_BINARY_DATA_MODE=filesystem environment variable to avoid using your memory for keeping large amounts binary data

buildingthings · September 13, 2023, 7:59am

Thank you so much!! I will try this out in my workflow. Appreciate the help for a n00b

buildingthings · September 28, 2023, 7:12am

One more question, @MutedJam → I have now been using the batching functionality all over the place, so thank you for teaching me that!

I’d like to know if there is a best practice around how many n to put in a batch. Creating a batch of 1, running my script, writing to Google Sheets and just looping that has been working well since I can see my output immediately, but I’m not sure if there’s something that I don’t know which makes this super inefficient. I’m not paying per API call on any of my services, but just wanted to know what the pros/cons are of larger or smaller batches, and if there are any dangers to be aware of, if you could speak to that.

MutedJam · September 28, 2023, 12:29pm

Hi @buildingthings, the short answer is “it depends” (but you probably knew this already ). Here are my thoughts on this:

Creating a batch of 1, running my script, writing to Google Sheets and just looping that has been working well since I can see my output immediately, but I’m not sure if there’s something that I don’t know which makes this super inefficient.

So specifically with regards to Google this is slightly less efficient than processing multiple items at once, but not by much. Using batches means you’ll call some nodes more than once, and each additional node execution comes at a small cost depending on your specific setup (typically very small fractions of a second in terms of computing time).

However, the Google Sheets API isn’t very performant anyway, so the usual waiting time when calling this API will outweigh the aforementioned overhead by far.

I’m not paying per API call on any of my services, but just wanted to know what the pros/cons are of larger or smaller batches, and if there are any dangers to be aware of, if you could speak to that.

pro:

less data processed with each individual sub-workflow execution which increases overall stability
better visibility (you can see partial progress when manually executing your workflow)

cons

slightly slower (this might matter when you work with very fast databases on your local network, but probably not when using external services such as Google Sheets or Airtable for example)
more API calls are being used (again, whether this matters will depend on the exact services you use)

Most often you probably want to pick a batch size larger than 1 but smaller than the total amount of items to get the best of both worlds.

Hope this makes sense!

buildingthings · September 29, 2023, 10:23am

Yes, thank you so much!! I think you’re right that with GSheets the advantages of reliability (and easier debugging when one fails, at least then you know which row caused it) might outweigh the potential for it to be slower because that doesn’t matter as much to me in this particular case. Thanks for helping me think this through!

system · December 28, 2023, 10:24am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.