I’m facing a challenge with processing large PDF files (around 50 MB) stored in Google Drive. My goal is to extract text from these PDFs and store it in a vector database, but I have some constraints:
Limited server space – My self-hosted n8n instance does not have enough disk space to download large PDF files.
Data sensitivity – The PDFs contain sensitive information, so I cannot use third-party cloud services for processing.
Performance – Using the native Google Drive “Download” node in n8n is extremely slow for these large files, sometimes running for 10–15 minutes with no result.
No additional software installation – I cannot install Python, gdown, or any other tools on the server due to space limitations.
I am looking for a solution that:
Can process large PDF files directly from Google Drive, preferably in memory without writing them to disk.
Works entirely within n8n, using built-in nodes (HTTP Request, Function/Code nodes) or any standard approach compatible with n8n self-hosted.
Safely extracts text from the PDF for further use in a database or vector store.
Has anyone successfully implemented a workflow like this? Any best practices or example templates would be highly appreciated.
Hey @Innovazione_Clesia hope all is good. The solution you are looking for could be a mistral OCR:
Basic OCR, which allows you to extract text content from images or documents (like PDFs, PPTX, DOCX), preserving the structure (headers, paragraphs, lists, tables), returning markdown-formatted output, along with image bounding boxes and metadata about the document layout
Annotations, which allows you to extract information in a structured JSON format that you define. It offers two modes:
bbox_annotation: annotate individual bounding boxes (e.g., charts, figures) based on your provided format—useful for describing or captioning figures.
document_annotation: annotate and extract structured information from the entire document according to a predefined JSON schema.
The best part is when calling their api you can specify the location of the document as a public link in the field document_url, for instance:
Thank you so much for your solution. Your suggestion to use the Mistral OCR API to handle the problem is an excellent idea.
Description of My Problem
I am looking to solve an issue with an n8n workflow that handles a large PDF file (49 MB) from Google Drive. My files are highly sensitive, so security is my top priority.
My final goal is to extract the text from the PDF and send it to a vector database. To do this, I need to process the file using an external OCR service like Mistral.
The main problem is that the Google Drive node in my workflow is timing out after 30 minutes when trying to download the file. This prevents me from securely processing the file within my n8n server, which is essential because of the files’ sensitive nature. I have already tried to configure the workflow to download the file by its ID, but the download does not complete.
I am looking for a solution that allows me to:
Download the file from Google Drive without the node timing out.
Keep the data secure, by avoiding making the file public via a shared URL.
I’ve been told that if I use the Mistral API and send the file as binary data, there is no risk to my files. Can you confirm if this is true and if anyone has experience with similar timeouts or handling large files in n8n?
It shouldn’t take so much time to download a 50MB file from GD. If it does, there is either a problem with your internet connection or something else is wrong with your workflow.
From looking at the workflow you’ve attached:
you search for files
you loop over found files
there is a problem where you will only process one file, because you do not close the loop from the supabase node back to the loop over items node). This needs to be fixed.
set ID
download the file
extract information from the file
set metadata
process the file / insert the embeddings
This workflow should work if your pdfs are text based, so if your PDFs do not include images or you don’t care about images in the PDF this should work without Mistral.
The biggest question is - if your restrictions are so tight and the documents are very sensitive ..
why are they located on Google Drive?
you are sending chunks of the docs to OpenAI to create vectors
you are sending chunks of the docs to Supabase to store