Efficiently Extract Text from Large PDF Files Stored in Google Drive Without Downloading Locally in n8n

Hello n8n Community,

I’m facing a challenge with processing large PDF files (around 50 MB) stored in Google Drive. My goal is to extract text from these PDFs and store it in a vector database, but I have some constraints:

  1. Limited server space – My self-hosted n8n instance does not have enough disk space to download large PDF files.

  2. Data sensitivity – The PDFs contain sensitive information, so I cannot use third-party cloud services for processing.

  3. Performance – Using the native Google Drive “Download” node in n8n is extremely slow for these large files, sometimes running for 10–15 minutes with no result.

  4. No additional software installation – I cannot install Python, gdown, or any other tools on the server due to space limitations.

I am looking for a solution that:

  • Can process large PDF files directly from Google Drive, preferably in memory without writing them to disk.

  • Works entirely within n8n, using built-in nodes (HTTP Request, Function/Code nodes) or any standard approach compatible with n8n self-hosted.

  • Safely extracts text from the PDF for further use in a database or vector store.

Has anyone successfully implemented a workflow like this? Any best practices or example templates would be highly appreciated.

Thank you in advance for your help!

1 Like

Hey @Innovazione_Clesia hope all is good. The solution you are looking for could be a mistral OCR:

  • Basic OCR, which allows you to extract text content from images or documents (like PDFs, PPTX, DOCX), preserving the structure (headers, paragraphs, lists, tables), returning markdown-formatted output, along with image bounding boxes and metadata about the document layout
  • Annotations, which allows you to extract information in a structured JSON format that you define. It offers two modes:
    • bbox_annotation: annotate individual bounding boxes (e.g., charts, figures) based on your provided format—useful for describing or captioning figures.
    • document_annotation: annotate and extract structured information from the entire document according to a predefined JSON schema.

The best part is when calling their api you can specify the location of the document as a public link in the field document_url, for instance:

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "include_image_base64": true
  }' -o ocr_output.json

or

curl --location 'https://api.mistral.ai/v1/ocr' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
--data '{
    "model": "mistral-ocr-latest",
    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
    "bbox_annotation_format": {
        "type": "json_schema",
        "json_schema": {
            "schema": {
                "properties": {
                    "document_type": {"title": "Document_Type", "type": "string"},
                    "short_description": {"title": "Short_Description", "type": "string"},
                    "summary": {"title": "Summary", "type": "string"}
                },
                "required": ["document_type", "short_description", "summary"],
                "title": "BBOXAnnotation",
                "type": "object",
                "additionalProperties": false
            },
            "name": "document_annotation",
            "strict": true
        }
    },
    "include_image_base64": true
}'

Hello!

Thank you so much for your solution. Your suggestion to use the Mistral OCR API to handle the problem is an excellent idea.


Description of My Problem

I am looking to solve an issue with an n8n workflow that handles a large PDF file (49 MB) from Google Drive. My files are highly sensitive, so security is my top priority.

My final goal is to extract the text from the PDF and send it to a vector database. To do this, I need to process the file using an external OCR service like Mistral.

The main problem is that the Google Drive node in my workflow is timing out after 30 minutes when trying to download the file. This prevents me from securely processing the file within my n8n server, which is essential because of the files’ sensitive nature. I have already tried to configure the workflow to download the file by its ID, but the download does not complete.

I am looking for a solution that allows me to:

  1. Download the file from Google Drive without the node timing out.
  2. Keep the data secure, by avoiding making the file public via a shared URL.

I’ve been told that if I use the Mistral API and send the file as binary data, there is no risk to my files. Can you confirm if this is true and if anyone has experience with similar timeouts or handling large files in n8n?

Hey @Innovazione_Clesia hope all is well.

It shouldn’t take so much time to download a 50MB file from GD. If it does, there is either a problem with your internet connection or something else is wrong with your workflow.

From looking at the workflow you’ve attached:

  • you search for files
  • you loop over found files
    • there is a problem where you will only process one file, because you do not close the loop from the supabase node back to the loop over items node). This needs to be fixed.
  • set ID
  • download the file
  • extract information from the file
  • set metadata
  • process the file / insert the embeddings

This workflow should work if your pdfs are text based, so if your PDFs do not include images or you don’t care about images in the PDF this should work without Mistral.

The biggest question is - if your restrictions are so tight and the documents are very sensitive ..

  • why are they located on Google Drive? :slight_smile:
  • you are sending chunks of the docs to OpenAI to create vectors
  • you are sending chunks of the docs to Supabase to store

Thanks a lot for the help! :folded_hands:

The issue with the loop was indeed the problem — I’ve fixed the workflow and now the loop is working correctly.

Just to clarify:
– I’m using n8n
Supabase is self-hosted
– My documents are stored on Google Drive

The problem still exists in my setup, but now at least the loop is properly configured. Thanks again for pointing me in the right direction!

You are welcome if this helped you to overcome that specific issue, please consider marking my answer as Solution, thank you.

If there is another problem in the setup, feel free to open a separate topic about this and tag me ther.

Cheers!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.