Built a Multimodal WhatsApp AI Assistant with RAG (n8n + MongoDB Atlas + OpenAI)

Hey n8n community :waving_hand:

Wanted to share something I’ve been working on a multimodal AI assistant that lives inside WhatsApp. Users can send text, voice notes, images, or documents, and it answers them from a custom knowledge base using RAG (Retrieval-Augmented Generation).

The whole thing is orchestrated in n8n. No custom backend, no glue services just nodes.

What it does

A user opens WhatsApp and can send literally any of the following:

  • :speech_balloon: Text message β†’ answered directly

  • :studio_microphone: Voice note β†’ transcribed, then answered

  • :framed_picture: Image β†’ described/analyzed, then answered

  • :page_facing_up: Document (PDF, XLSX, XLS, JSON) β†’ content extracted, then answered

Every response is grounded in a private knowledge base stored as vector embeddings in MongoDB Atlas. So instead of generic LLM answers, you get answers about your documents product manuals, SOPs, course material, whatever you upload.

Architecture overview

There are two workflows:

1. Knowledge Base Ingestion (the β€œloader”)

This is the offline pipeline that prepares the knowledge base.

Execute Workflow Trigger
   β†’ Google Docs Importer
   β†’ Document Section Loader
   β†’ Document Chunker
   β†’ OpenAI Embeddings Generator
   β†’ MongoDB Vector Store Inserter

The MongoDB collection has a vector search index configured with:

  • embedding as a knnVector field (cosine similarity)

  • source and doc_id as filterable metadata fields

This runs whenever I add new source material.

2. WhatsApp Conversational Agent (the β€œbrain”)

This is the real-time workflow that handles incoming messages.

WhatsApp Trigger
   β†’ Route Types (switch based on message type)
       β”œβ”€β”€ Text   β†’ Map text prompt
       β”œβ”€β”€ Voice  β†’ Get URL β†’ Download β†’ OpenAI Whisper β†’ transcribe
       β”œβ”€β”€ Image  β†’ Get URL β†’ Download β†’ OpenAI Vision β†’ describe
       └── Document β†’ Get URL β†’ Download β†’ detect extension
                       β”œβ”€β”€ PDF  β†’ Extract from PDF
                       β”œβ”€β”€ XLS  β†’ Extract from XLS
                       β”œβ”€β”€ XLSX β†’ Extract from XLSX
                       β”œβ”€β”€ JSON β†’ Map JSON
                       └── else β†’ Send "Unsupported" reply
   β†’ Knowledge Base Agent (AI Agent node)
       β”œβ”€β”€ OpenAI Chat Model
       β”œβ”€β”€ Memory (conversation context)
       └── MongoDB Vector Search (with OpenAI Embeddings)
   β†’ Send Response to WhatsApp

The clever bit is the Route Types switch right after the trigger. WhatsApp’s webhook payload tells you whether the inbound message is text, audio, image, or document, and each path then normalizes the content into a plain text prompt before handing off to the AI Agent. By the time the agent sees the message, it doesn’t matter whether the user typed it, said it, drew it, or uploaded it as a PDF it’s all just text.

Tech stack

Layer Tool
Orchestration n8n
Messaging WhatsApp Business API
LLM OpenAI (GPT for chat, Whisper for voice, Vision for images)
Embeddings OpenAI text-embedding-3-small
Vector DB MongoDB Atlas Vector Search
Memory n8n’s built-in chat memory

Things I learned along the way

A few things that tripped me up that might save someone else time:

  • WhatsApp media URLs expire fast. You have to download the media inside the same execution, before the URL goes stale. Don’t try to be clever and pass URLs around.

  • Chunking strategy matters more than you think. I started with naive fixed-size chunks and got noisy retrieval. Switching to section-based chunking (using the Document Section Loader before the Chunker) dramatically improved answer quality.

  • Always have a fallback path. The β€œSend Unsupported Response” branch for weird file types keeps the bot from silently failing when someone sends a .heic or a .zip.

  • Memory is worth the extra node. Without it, every message feels cold. With it, the bot can handle follow-up questions like β€œand what about the second point?” properly.

What’s next

Things I’m planning to add:

  • Per-user knowledge bases (right now everyone hits the same index)

  • Image generation responses (not just image understanding)

  • A simple admin command (/reindex) to trigger the ingestion workflow from WhatsApp itself


Happy to share more details on any specific part of the build just drop a comment. If you’re building something similar and got stuck on the WhatsApp media handling or MongoDB vector index setup, I’m happy to help.

Cheers :rocket:

2 Likes