Can AI Agents Work With Files?

Hi, I’m trying to pass one or more files into an AI Agent cluster.

To test, I’ve tried uploading one file with the OpenAI node and then providing the File ID in the prompt, but it seems like the model cannot access the file.

(The response is always example data generated by the model, instead of parsed data from the uploaded file.)

I’m trying to build a feature in my app where users can click a button called “auto-underwrite statements with AI” on a deal. Once clicked, the deal’s bank statements will be sent to gpt-4-turbo and the financial information parsed from these statements will auto-populate on the deal.

Am I using the wrong node for this use case? From my understanding:

  • Basic LLM Chain can’t interact with files.
  • Question and Answer Chain can retrieve documents from vector storage, but is this meant for parsing many data points? The name “Question and Answer” implies that you can only ask one question at a time.
  • Summarization Chain can interact with binary data directly from the previous node in n8n, but is limited to summarization tasks.
  • AI Agent seems like the most flexible, but has no capabilities for interacting with files as far as I can tell.
  • I can create an Assistant in OpenAI with access to a previously uploaded file, but I get the following error: “The maximum number of files that can be attached to the assistant is 20”

Please point me in the right direction as I’ve already read all of n8n’s docs on AI.

1 Like

It looks like your topic is missing some important information. Could you provide the following if applicable.

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:
  • n8n version: 1.37.3
  • Database (default: SQLite): default
  • n8n EXECUTIONS_PROCESS setting (default: own, main): default
  • Running n8n via (Docker, npm, n8n cloud, desktop app): n8n cloud
  • Operating system: MacOS, Arc browser

Hi Fundmore,

Here’s an example for you to consider. In this flow, you get a pdf that has some financial info and then convert to text.

Then you feed that into a basic chain node and ask GPT4 to extract what you would like.

I recommend using GPT-4o instead of GPT4-Turbo. The pricing is better and the quality is about the same.

The other thing to note is if you want OpenAI to extract the info in a certain format, you can add in the basic chain node the option to output in a specific JSON format that you define via a JSON schema.

Other considerations: google gemini flash 1.5 and claude 3 haiku models are also pretty good as extraction tasks and they have really large context windows, so worth seeing if these models are also good for your use case. The reason I bring this up is that they tend to cost a lot less than gpt4 family of models.

Hope this helps,
Derek

1 Like

Thank you for the help Derek.

For anyone looking to use OpenAI for converting PDFs to text:

After a few days of trial and error, I’ve concluded that LLMs alone are insufficient for my use case (parsing bank statements).

Sometimes, bank statements will be a scan or photocopy and that requires image processing rather than just PDF text extraction.

Using gpt-4o for OCR is not ideal because it seems to hallucinate and produce false information where many online OCR tools get it right.

I haven’t tried open-source OCR engines like Tesseract because I know that image pre-processing (including Gaussian blur) is often required for best results, and that’s too much for me to get into.

I would love to try Nougat (seems like it combines PDF text extraction with image OCR), but I can’t get the Python client to work and I’m not familiar with the command line.

Anyways, I’m on n8n cloud and Pyodide’s built-in packages don’t include it.

I think the only option would be outsourcing to an external service via API, ideally one which uses a combination of PDF text extraction, image OCR, and AI assist.

3 Likes

Hey Fundmore,

Just to comment on OCR solutions, I’d highly recommend Google Cloud’s DocumentAI offering. I’ve found the service is be fast with consistent, solid results for any type of scan. The only caveat is that they may have differing pricing for forms (not sure!) but otherwise, incredibly good value for money. I wrote a little about my experience here.

Edit: Also for an alternative approach, I also wrote about a similar task parsing invoices using LlamaParse. I found converting PDF tables to markdown tables allowed the LLM to understand structured data more easily.

2 Likes

Hi Jim, thank you for your insightful reply. I will be trying both solutions and I’ll share my results here.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.