Parse Large CSV

wrwatk · October 1, 2022, 11:32pm

Hi,

Needing some assistance with parsing a very large csv with a lot of rows & information.

Information on the CSV File

I have a CSV File that gets updated daily with numerous lines. The csv itself contains around 80 000 lines, with around 91 header rows. I know a massive file and not a file that is very nice to work with.

What I would like to achieve is the following.

Remove the header rows & respective column information that I don’t need in order to make the file “workable” therefore reducing from having 91 headers to approx. 7

Link to the files:

Excel File Headers: Headers in yellow is what needs to be removed: Remove Headers & Column Data
CSV Header File: CSV Header File
Dummy CSV File with ALL the data: Dummy CSV File

I am not even sure if this is even possible with N8N but I have tried all I can think of like functions, split in batches with no luck. As for the workflow at the moment, there is nothing special.

Information on your n8n setup

n8n Version0.195.5
Running n8n via Docker on Ubuntu VPS Server

Thank you for you help.

Jon · October 3, 2022, 11:23am

Hey @wrwatk,

Have you tried using a Set Node to extract the data you are after then setting that to “keep only set”?

wrwatk · October 3, 2022, 8:08pm

@Jon I had though about doing that but correct me if I am wrong but for me to extract the information I need I would first need to read the file content and that is where the issue starts. There is simply to much data to be read through for Docker to handle. my rough maths is 81 column with 80 000 rows equates to around 6.4 million fields with text. If you know of another way I am all ears.

Jon · October 4, 2022, 8:20am

Hey @wrwatk,

You are not wrong there but no matter what you do the data will still have to be read, When you try to open the file does it show any errors I would imagine in the UI it will be very slow to show the data but when running in the background it should be ok assuming the resources are set correctly for what you are doing.

Outside of this though… How did the file get so large and have you thought about using a database or something like Baserow instead?

wrwatk · October 4, 2022, 8:39am

@Jon

Thanks for the suggestion and you are correct with the database and that’s exactly what I’m trying to do in getting all this information into a database, which is proving to be difficult. The issue is I need this to be done daily as the file I receive is sent from an external source and contains information I need to parse daily.

Jon · October 4, 2022, 9:07am

Hey @wrwatk,

I guess what I would maybe do is set it up with just the header document then schedule it to run from the full file and see if it works as expected.

wrwatk · October 4, 2022, 9:20am

Hi @Jon

I like your thinking and that could potentially work. Would you be able/willing to provide me with a starting workflow that I could start with?