Processing CSV files more than 100k records in Batches

Takshak · March 26, 2024, 5:34pm

I am trying to download a CSV file and then load into a MongoDB collection. The file is large and can easily contain 100k-200k records, file size approx 20-30 MB. I tried processing direct but it is giving me memory warning, possible reason seems to be having more than 100k json records. Thats ok and understandable. Now I am trying to process it in batches using Extract from File node, batch size 1000. But now i am confused as in How to iterate to next 1000 records. I tried using simple loop with Starting Line option in Extract From File to dynamically read the next 1000 record based on runIndex but that node expects another binary file instead.

Can any one please guide how to loop over and successfully process 1000 records until there is no more records in CSV?

Share the output returned by the last node

Information on your n8n setup

**n8n version:**1:32.1
Database (default: SQLite): MongoDB
n8n EXECUTIONS_PROCESS setting (default: own, main):
**Running n8n via (Docker, npm, n8n cloud, desktop app):**n8n Cloud
Operating system:

Ludwig · March 28, 2024, 9:47pm

Hi @Takshak! Thanks for reaching out! You are probably looking for the Loop over Items node.

There are a number of great examples of that node in use in our forum here, if you search for the node. You may also look into decreasing the batch size or splitting the CSV file into many, storing it in a cloud storage tool like S3, then iterating through all of the csv files and breaking down the batches even further in size.

But that’s just a recommendation. Happy building!

Takshak · March 28, 2024, 11:23pm

Thank you Ludwig!

Perhaps i was not very clear on my requirement or may be i am missing something. If I understand correctly, Loop Over Items will split a large number of items into batches and then loop until last item. But in my case, it will only be applicable after all the records are extracted from the CSV file and feeded into Loop over Items node. Thats where the problem is, i am unable to extract the full records from the csv file, i am getting out of memory or something sismilar error there, when i checked the documentations, it said i cant have very large number of JSON records. So i am extracting the records in the multiple of 100k records but i am facing hard time looping back to extract node to extract next 100k records.

See the screenshots below:

Max number of rows - 100k

Max number of rows > 100k

In this case if I cant get the full records out how can I pass it to Loop Over Items node?

When i am loopong back, it says it is expecting object type as input. Please confirm if my understanding is correct and advice on the possible solution. Also confirm of Loop Over Item can extract records from csv directly and loop?

Regards,
Vishnu

Ludwig · March 29, 2024, 3:00pm

Hi Vishnu, yeah due to the memory limitations you are limited in terms of your options. Your best one (and fastest) right now is to upgrade to a higher tier cloud plan or to self-host to manage memory.

Takshak · March 29, 2024, 4:31pm

What are the numbers in terms of memory limitation in each cloud tier? In self host also what are the recommended memory / cpu for processing upto 200k json records? I have tried it on 1 gb ram and 1 cpu server just for this simple workflow, no other load, it hanged entire server, seems inefficient. Before running this workflow cpu and memory utilization were under 20%. If processing 21 mb file consumes whole 750 mb if memory, i have serious doubts on the performance and efficiency of N8N.

Is there any way to restrict such workflows from utilising all the available memory or cpu in self host?

Ludwig · March 30, 2024, 3:17am

One thing I wanted to highlight was I said your fastest (not only) solution was to upgrade or self-host. I was able to run your workflow with just a minor tweak on a Pro tier cloud plan. That being said, there are alternative paths you could go down for preprocessing your file in such a way that you wouldn’t ever encounter a memory warning even on our lower tier paid plans, if that’s your desired approach.

You can find a number of these mentioned by community members here if you search for CSV-related issues. Example:

Regarding your memory-related questions, I encourage you to reach through these articles.

I hope this helps!

Takshak · March 30, 2024, 7:59am

Thank you Ludwig for your prompt response!

Can you please share what exact minor tweak you did to make my workflow work even in Pro plan? Meanwhile I will go through the the shared links, thank you!

Regards,
Vishnu

Ludwig · April 3, 2024, 4:03am

Hi @Takshak, the only change was to remove the final connection from the HTTP Request to the Extract from File node and disable the HTTP request node.

Takshak · April 3, 2024, 8:52am

I dont get it, Why would I disconnect and disable the HTTP Node if my goal is to make that work and extract the data from it successfully?

Ludwig · April 3, 2024, 6:06pm

I meant like this:

You can see that the return connection where you previously had your HTTP Response node connect back to your Extract from File node was removed.

Takshak · April 3, 2024, 6:39pm

Thanks, but that was anyways not required and i was unable to reach there in the loop on first place. It was failling everytime on the extract node. Umable to extract 150k or more JSON records. Loop was to process it in batches of 1k or 10k.

How did it worked on yours while not on mine? I am on Pro plan i believe? Did you mean higher pro plans?

lightcom · April 4, 2024, 8:43am

Try to find API calls

https://api.dhan.co/#/

Takshak · April 4, 2024, 6:33pm

Checked already, there is none.

Jon · April 5, 2024, 3:19pm

Hey @Takshak,

I have taken a look and at the moment we can’t loop CSV files like that so it would be a case of making smaller files with chunks of data which should work if that fails I would look again at the API to see if there is a better way to handle it until we can support looping csv files.

lightcom · April 8, 2024, 11:29am

I’ve looked at part of the document and see that you may not need to log this information very often, and if it was a third party viewer for MongoDB with a one time manual csv upload it would make it easier.
Directus, Supabase, phpMyAdmin have such functionality. So you can do it manually.
Make two bases the first active, the second for synchronisation. And compare them with each other, adding or discarding unnecessary records.
Or find a service that receives exactly the same parameters you need and filtering by API

system · July 7, 2024, 11:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.