Hello friends,
I need your help with a use case I’m working on in n8n, and I’m stuck.
Here’s my scenario:
When a new row is created in an Airtable table, it contains attached files of various types: .doc
, .docx
, .csv
, .xls
, .xlsx
, .pdf
, .pptx
, .zip
, etc.
I’ve set up a workflow in n8n that is supposed to:
- Download the files
- Separate zipped from non-zipped files
- Unzip the zipped ones
- Convert all files to plain text (converted file can be up to 300 pages)
- Send the extracted text to ChatGPT to extract structured data
Here’s what I’ve done so far:
- Download the files
- Unzip if needed
- Merge all files into a single PDF using the pdf.co node
- Convert that merged PDF to text
- Split the text into chunks
- Send the chunks to ChatGPT
But I’m running into major issues with pdf.co when converting to PDF, merging, and extracting the text.
The files are often large, the process takes up to 30 minutes, consumes a lot of credits, and fails frequently.
So I need your help with:
- How would you extract text from multiple files more efficiently within n8n?
- Is pdf.co the right tool for this?
- Are there better alternative nodes or tools to use?
- How would you build a workflow like this that runs fast and reliably, without bugs?
Thanks a lot for your help!
And if you can share an example JSON workflow, that would be incredibly helpful for me to move forward.