Word files to Qdrant library

TjM · July 7, 2025, 5:57pm

try if this works …

Overall Workflow:

Key Solutions for Text Extraction:

PDFs:
- Use an n8n Community PDF to Text Node (best).
- Or, send PDF to an External PDF API via HTTP Request.
Word (.docx) Files (Your Main Problem):
1. Recommended (Self-hosted n8n): Install and use a Community DOCX to Text Node (e.g., n8n-docx-converter). This is the cleanest.
2. Google Docs API (n8n Native):
- Download DOCX from Drive.
- Use Google Drive node to Copy File, setting Mime Type to application/vnd.google-apps.document (converts to Google Doc).
- Download the new Google Doc using Google Drive node, setting Mime Type to text/plain (exports as text).
- (Optional) Delete temporary Google Doc.
1. Python Script (Advanced, Self-hosted n8n):
- Download DOCX binary.
- Use Write File node to save it temporarily.
- Use Execute Command to run a Python script (with docx2txt library) to convert it to text.
- Read File the output text.
1. External Conversion API: Send DOCX to an external service (e.g., CloudConvert) via HTTP Request.

After Text Extraction:

Chunking: Use a Code Node (or a text splitter node if available) to break the text into manageable chunks (e.g., 500 characters with overlap).
Embeddings: Use an Embedding Node (e.g., OpenAI, Cohere, Google Gemini) to convert each text chunk into a vector.
Qdrant: Use the Qdrant Vector Store node to “Insert” these vectors, including the original text chunk and metadata (filename, page/section) as payload.