Efficiently Extract Text from Large PDF Files Stored in Google Drive Without Downloading Locally in n8n

Innovazione_Clesia · September 1, 2025, 11:19am

Hello n8n Community,

I’m facing a challenge with processing large PDF files (around 50 MB) stored in Google Drive. My goal is to extract text from these PDFs and store it in a vector database, but I have some constraints:

Limited server space – My self-hosted n8n instance does not have enough disk space to download large PDF files.
Data sensitivity – The PDFs contain sensitive information, so I cannot use third-party cloud services for processing.
Performance – Using the native Google Drive “Download” node in n8n is extremely slow for these large files, sometimes running for 10–15 minutes with no result.
No additional software installation – I cannot install Python, gdown, or any other tools on the server due to space limitations.

I am looking for a solution that:

Can process large PDF files directly from Google Drive, preferably in memory without writing them to disk.
Works entirely within n8n, using built-in nodes (HTTP Request, Function/Code nodes) or any standard approach compatible with n8n self-hosted.
Safely extracts text from the PDF for further use in a database or vector store.

Has anyone successfully implemented a workflow like this? Any best practices or example templates would be highly appreciated.

Thank you in advance for your help!

jabbson · September 2, 2025, 2:03am

Hey @Innovazione_Clesia hope all is good. The solution you are looking for could be a mistral OCR:

Basic OCR, which allows you to extract text content from images or documents (like PDFs, PPTX, DOCX), preserving the structure (headers, paragraphs, lists, tables), returning markdown-formatted output, along with image bounding boxes and metadata about the document layout
Annotations, which allows you to extract information in a structured JSON format that you define. It offers two modes:
- bbox_annotation: annotate individual bounding boxes (e.g., charts, figures) based on your provided format—useful for describing or captioning figures.
- document_annotation: annotate and extract structured information from the entire document according to a predefined JSON schema.

The best part is when calling their api you can specify the location of the document as a public link in the field document_url, for instance:

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "include_image_base64": true
  }' -o ocr_output.json

or

curl --location 'https://api.mistral.ai/v1/ocr' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
--data '{
    "model": "mistral-ocr-latest",
    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
    "bbox_annotation_format": {
        "type": "json_schema",
        "json_schema": {
            "schema": {
                "properties": {
                    "document_type": {"title": "Document_Type", "type": "string"},
                    "short_description": {"title": "Short_Description", "type": "string"},
                    "summary": {"title": "Summary", "type": "string"}
                },
                "required": ["document_type", "short_description", "summary"],
                "title": "BBOXAnnotation",
                "type": "object",
                "additionalProperties": false
            },
            "name": "document_annotation",
            "strict": true
        }
    },
    "include_image_base64": true
}'

Innovazione_Clesia · September 3, 2025, 8:01am

Hello!

Thank you so much for your solution. Your suggestion to use the Mistral OCR API to handle the problem is an excellent idea.

Description of My Problem

I am looking to solve an issue with an n8n workflow that handles a large PDF file (49 MB) from Google Drive. My files are highly sensitive, so security is my top priority.

My final goal is to extract the text from the PDF and send it to a vector database. To do this, I need to process the file using an external OCR service like Mistral.

The main problem is that the Google Drive node in my workflow is timing out after 30 minutes when trying to download the file. This prevents me from securely processing the file within my n8n server, which is essential because of the files’ sensitive nature. I have already tried to configure the workflow to download the file by its ID, but the download does not complete.

I am looking for a solution that allows me to:

Download the file from Google Drive without the node timing out.
Keep the data secure, by avoiding making the file public via a shared URL.

I’ve been told that if I use the Mistral API and send the file as binary data, there is no risk to my files. Can you confirm if this is true and if anyone has experience with similar timeouts or handling large files in n8n?

jabbson · September 3, 2025, 2:18pm

Hey @Innovazione_Clesia hope all is well.

It shouldn’t take so much time to download a 50MB file from GD. If it does, there is either a problem with your internet connection or something else is wrong with your workflow.

From looking at the workflow you’ve attached:

you search for files
you loop over found files
- there is a problem where you will only process one file, because you do not close the loop from the supabase node back to the loop over items node). This needs to be fixed.
set ID
download the file
extract information from the file
set metadata
process the file / insert the embeddings

This workflow should work if your pdfs are text based, so if your PDFs do not include images or you don’t care about images in the PDF this should work without Mistral.

The biggest question is - if your restrictions are so tight and the documents are very sensitive ..

why are they located on Google Drive?
you are sending chunks of the docs to OpenAI to create vectors
you are sending chunks of the docs to Supabase to store

Innovazione_Clesia · September 4, 2025, 8:05am

Thanks a lot for the help!

The issue with the loop was indeed the problem — I’ve fixed the workflow and now the loop is working correctly.

Just to clarify:
– I’m using n8n
– Supabase is self-hosted
– My documents are stored on Google Drive

The problem still exists in my setup, but now at least the loop is properly configured. Thanks again for pointing me in the right direction!

jabbson · September 4, 2025, 1:45pm

You are welcome if this helped you to overcome that specific issue, please consider marking my answer as Solution, thank you.

If there is another problem in the setup, feel free to open a separate topic about this and tag me ther.

Cheers!

Topic		Replies	Views
Open PDF as Google Doc Questions	12	250	June 1, 2025
PDF to text conversion not working in Google Drive node Questions credentials , node	4	680	January 11, 2025
Google drive download and execute command Questions data-transformation	4	899	March 4, 2025
N8n – Help Needed: Efficient File-to-Text Extraction Workflow (Multi-format, Large Files) Questions data-transformation , workflow-building	3	413	September 2, 2025
PDF to Google Docs? Hepl me converting PDF from Google Drive to DOC format in n8n without external services Questions read-pdf	4	340	June 2, 2025

Efficiently Extract Text from Large PDF Files Stored in Google Drive Without Downloading Locally in n8n

Description of My Problem

Related topics