try if this works …
Overall Workflow:
- Get File: Fetch documents from Google Drive.
- Extract Text: Convert PDF/DOCX to plain text.
- Chunk Text: Break text into smaller pieces.
- Embed: Turn text chunks into numerical vectors.
- Store: Save vectors and original text/metadata in Qdrant.
Key Solutions for Text Extraction:
- PDFs:
- Use an n8n Community PDF to Text Node (best).
- Or, send PDF to an External PDF API via HTTP Request.
- Word (
.docx
) Files (Your Main Problem):- Recommended (Self-hosted n8n): Install and use a Community DOCX to Text Node (e.g.,
n8n-docx-converter
). This is the cleanest. - Google Docs API (n8n Native):
- Download DOCX from Drive.
- Use Google Drive node to Copy File, setting
Mime Type
toapplication/vnd.google-apps.document
(converts to Google Doc). - Download the new Google Doc using Google Drive node, setting
Mime Type
totext/plain
(exports as text). - (Optional) Delete temporary Google Doc.
- Python Script (Advanced, Self-hosted n8n):
- Download DOCX binary.
- Use
Write File
node to save it temporarily. - Use
Execute Command
to run a Python script (withdocx2txt
library) to convert it to text. Read File
the output text.
- External Conversion API: Send DOCX to an external service (e.g., CloudConvert) via HTTP Request.
- Recommended (Self-hosted n8n): Install and use a Community DOCX to Text Node (e.g.,
After Text Extraction:
- Chunking: Use a Code Node (or a text splitter node if available) to break the text into manageable chunks (e.g., 500 characters with overlap).
- Embeddings: Use an Embedding Node (e.g., OpenAI, Cohere, Google Gemini) to convert each text chunk into a vector.
- Qdrant: Use the Qdrant Vector Store node to “Insert” these vectors, including the original text chunk and metadata (filename, page/section) as payload.