N8n – Help Needed: Efficient File-to-Text Extraction Workflow (Multi-format, Large Files)

Hello friends,

I need your help with a use case I’m working on in n8n, and I’m stuck.

Here’s my scenario:
When a new row is created in an Airtable table, it contains attached files of various types: .doc, .docx, .csv, .xls, .xlsx, .pdf, .pptx, .zip, etc.

I’ve set up a workflow in n8n that is supposed to:

  1. Download the files
  2. Separate zipped from non-zipped files
  3. Unzip the zipped ones
  4. Convert all files to plain text (converted file can be up to 300 pages)
  5. Send the extracted text to ChatGPT to extract structured data

Here’s what I’ve done so far:

  1. Download the files
  2. Unzip if needed
  3. Merge all files into a single PDF using the pdf.co node
  4. Convert that merged PDF to text
  5. Split the text into chunks
  6. Send the chunks to ChatGPT

But I’m running into major issues with pdf.co when converting to PDF, merging, and extracting the text.
The files are often large, the process takes up to 30 minutes, consumes a lot of credits, and fails frequently.

So I need your help with:

  1. How would you extract text from multiple files more efficiently within n8n?
  2. Is pdf.co the right tool for this?
  3. Are there better alternative nodes or tools to use?
  4. How would you build a workflow like this that runs fast and reliably, without bugs?

Thanks a lot for your help!
And if you can share an example JSON workflow, that would be incredibly helpful for me to move forward.

Hi, are the PDFs scanned or digitally created PDFs?
If so you don’t need OCR or services like pdf.co, you can just use the command line.
You will hover need to modify the Dockerfile to install some additional PDF tools, then you can just convert everything with pdf2text.
This does require a selfhosted n8n. Can provide you some instructions on how to set everything up.

Thank you @crisl for your response and your proposed solutions.
Indeed, among all the documents, there can be PDFs as well as other formats such as CSV, DOC, DOCX, XLS, and even multiple zipped documents, etc.
Some of these documents — whether PDFs or not — may contain images or scanned sections.
So, if we can include a service that performs OCR, that would be great.

Yes, I’m interested. If you could share more details about the instructions, that would help.
Also, if we could have a quick video call this morning, it would be easier for me to show you what I’ve done so far and better understand your instructions on how to implement your suggestion.