Automated n8n workflow for analyzing, renaming, and AI-assisted processing of 10,000 OCR-scanned PDF documents

Hello dear n8n community,

I’m reaching out for support in building a comprehensive workflow using n8n to process around 10,000 locally stored PDF documents. All of these files have been scanned and OCR-processed using NAPS2. The goal is to automatically open each of these files and extract key information from the text, including: date, category (e.g., invoice, notice, court ruling), authority or institution, sender, recipient, reference number or case ID, subject or title, the name of the responsible person, and the number of pages.

The primary task, first and foremost, is to automatically rename each PDF file using the extracted information, following this filename pattern:
[Date]_[Category]_[Authority or Sender]_[Subject or CaseID]_[PageCount].pdf

This automated renaming is the main goal. However, in order to save future steps and avoid redundant processing, it would be ideal if the final result could be achieved right at the beginning. Therefore, I’d also like to generate a structured Excel table listing all extracted data per file, along with a .txt file for each PDF that summarizes the contents. These text files would later be used for AI-based processing (e.g. content checking, reconstruction, or classification via GPT).

In addition, I plan to connect a second AI agent that, based on the extracted authority or institution, automatically searches and adds contact information: email address, phone number, fax number, website, street, postal code, the responsible director or department head, as well as the superior and subordinate authorities. These details will feed into a structured contact directory for future reference.

If one or more required fields in a document cannot be identified or clearly assigned, the file should be automatically moved to a separate “Miscellaneous” folder, marked with an indication of which fields are missing. A follow-up workflow or AI bot should later be able to handle these cases and fill in the missing information.

So my main questions to you are:
Does a similar workflow already exist? Can someone help me with building it? And if not — would anyone be interested in creating this workflow together with me?

I deeply appreciate any tips, assistance, or collaborative offers. This task is part of a larger project in which n8n is intended to serve as the central automation platform, integrating structure, efficiency, and AI from the ground up.

Warm regards,

Hi, I am trying to achieve a similar workflow where I am trying to develop a home document manager. Ultimately, what this workflow does is analyse correspondence received by reviewing a scanned PDF and then renames it accordingly and saves it in the appropriate folder within google drive. The AI also determines if any action is required and emails me to notify. It can also determine if the action is a task done by a certain date and provide a link in the email to allow to add the task to Tasker or my calendar. I have most of this working, however my stumbling block is recognising the contents of the PDF, which being a scan of the corresponence as an image in a PDF rapper, I get an error message saying that there is no binary data.

I would be happy to help you with your workflow but also you or somebody in this forum may be good enough to help me with my issue.

1 Like

Hi,

thank you very much for your kind response and for sharing your workflow with me – it sounds very well thought out and aligns closely with the vision I have for my own system. Your setup with automatic renaming, categorization, saving to Google Drive, and even detecting action items with calendar integration is exactly the kind of smart document processing I’m aiming for. Really impressive how far you’ve already come!

I’m currently working on implementing everything with n8n as the central automation platform. All of my documents have been scanned and OCR-processed using NAPS2. The trigger that fires when a new file is added to the folder already works on my side. However, the extraction of text content from the PDFs remains a major obstacle. Although tools like ChatGPT claim to support OCR-processed PDFs, in practice the AI often fails to extract the embedded text – especially when the PDF layout or encoding is more complex. Technically, it should work, but something still seems to be blocking the process.

Over the next few days, I’ll be double-checking my Docker folder mappings and file permissions to make sure n8n can access everything correctly. Meanwhile, I’m continuing to scan thousands of documents – and I hope to be presented with a working solution by the end of the month – whether via n8n, OpenAI, or with help from someone in the community. I truly believe that this process is important, technically feasible, and should be relatively straightforward – even if, at the moment, it raises more questions than it answers.

If you’d be willing to share your workflow (or even just parts of it), I’d be very grateful. It might help us better understand the issues around text recognition and figure out what’s going wrong. Of course, I’ll also share any new insights I discover myself or receive from others right here in the forum.

Thanks again for your openness. I truly believe this kind of automation has huge potential to bring clarity, structure, and intelligent processing into otherwise chaotic document workflows – and that, with a bit more collaboration, we’re very close to achieving something practical and powerful.

Warm regards,
Chrystian