Word files to Qdrant library

try if this works …

Overall Workflow:

  1. Get File: Fetch documents from Google Drive.
  2. Extract Text: Convert PDF/DOCX to plain text.
  3. Chunk Text: Break text into smaller pieces.
  4. Embed: Turn text chunks into numerical vectors.
  5. Store: Save vectors and original text/metadata in Qdrant.

Key Solutions for Text Extraction:

  • PDFs:
    • Use an n8n Community PDF to Text Node (best).
    • Or, send PDF to an External PDF API via HTTP Request.
  • Word (.docx) Files (Your Main Problem):
    1. Recommended (Self-hosted n8n): Install and use a Community DOCX to Text Node (e.g., n8n-docx-converter). This is the cleanest.
    2. Google Docs API (n8n Native):
    • Download DOCX from Drive.
    • Use Google Drive node to Copy File, setting Mime Type to application/vnd.google-apps.document (converts to Google Doc).
    • Download the new Google Doc using Google Drive node, setting Mime Type to text/plain (exports as text).
    • (Optional) Delete temporary Google Doc.
    1. Python Script (Advanced, Self-hosted n8n):
    • Download DOCX binary.
    • Use Write File node to save it temporarily.
    • Use Execute Command to run a Python script (with docx2txt library) to convert it to text.
    • Read File the output text.
    1. External Conversion API: Send DOCX to an external service (e.g., CloudConvert) via HTTP Request.

After Text Extraction:

  • Chunking: Use a Code Node (or a text splitter node if available) to break the text into manageable chunks (e.g., 500 characters with overlap).
  • Embeddings: Use an Embedding Node (e.g., OpenAI, Cohere, Google Gemini) to convert each text chunk into a vector.
  • Qdrant: Use the Qdrant Vector Store node to “Insert” these vectors, including the original text chunk and metadata (filename, page/section) as payload.
1 Like