Hey n8n community 
Wanted to share something Iβve been working on a multimodal AI assistant that lives inside WhatsApp. Users can send text, voice notes, images, or documents, and it answers them from a custom knowledge base using RAG (Retrieval-Augmented Generation).
The whole thing is orchestrated in n8n. No custom backend, no glue services just nodes.
What it does
A user opens WhatsApp and can send literally any of the following:
-
Text message β answered directly -
Voice note β transcribed, then answered -
Image β described/analyzed, then answered -
Document (PDF, XLSX, XLS, JSON) β content extracted, then answered
Every response is grounded in a private knowledge base stored as vector embeddings in MongoDB Atlas. So instead of generic LLM answers, you get answers about your documents product manuals, SOPs, course material, whatever you upload.
Architecture overview
There are two workflows:
1. Knowledge Base Ingestion (the βloaderβ)
This is the offline pipeline that prepares the knowledge base.
Execute Workflow Trigger
β Google Docs Importer
β Document Section Loader
β Document Chunker
β OpenAI Embeddings Generator
β MongoDB Vector Store Inserter
The MongoDB collection has a vector search index configured with:
-
embeddingas aknnVectorfield (cosine similarity) -
sourceanddoc_idas filterable metadata fields
This runs whenever I add new source material.
2. WhatsApp Conversational Agent (the βbrainβ)
This is the real-time workflow that handles incoming messages.
WhatsApp Trigger
β Route Types (switch based on message type)
βββ Text β Map text prompt
βββ Voice β Get URL β Download β OpenAI Whisper β transcribe
βββ Image β Get URL β Download β OpenAI Vision β describe
βββ Document β Get URL β Download β detect extension
βββ PDF β Extract from PDF
βββ XLS β Extract from XLS
βββ XLSX β Extract from XLSX
βββ JSON β Map JSON
βββ else β Send "Unsupported" reply
β Knowledge Base Agent (AI Agent node)
βββ OpenAI Chat Model
βββ Memory (conversation context)
βββ MongoDB Vector Search (with OpenAI Embeddings)
β Send Response to WhatsApp
The clever bit is the Route Types switch right after the trigger. WhatsAppβs webhook payload tells you whether the inbound message is text, audio, image, or document, and each path then normalizes the content into a plain text prompt before handing off to the AI Agent. By the time the agent sees the message, it doesnβt matter whether the user typed it, said it, drew it, or uploaded it as a PDF itβs all just text.
Tech stack
| Layer | Tool |
|---|---|
| Orchestration | n8n |
| Messaging | WhatsApp Business API |
| LLM | OpenAI (GPT for chat, Whisper for voice, Vision for images) |
| Embeddings | OpenAI text-embedding-3-small |
| Vector DB | MongoDB Atlas Vector Search |
| Memory | n8nβs built-in chat memory |
Things I learned along the way
A few things that tripped me up that might save someone else time:
-
WhatsApp media URLs expire fast. You have to download the media inside the same execution, before the URL goes stale. Donβt try to be clever and pass URLs around.
-
Chunking strategy matters more than you think. I started with naive fixed-size chunks and got noisy retrieval. Switching to section-based chunking (using the Document Section Loader before the Chunker) dramatically improved answer quality.
-
Always have a fallback path. The βSend Unsupported Responseβ branch for weird file types keeps the bot from silently failing when someone sends a
.heicor a.zip. -
Memory is worth the extra node. Without it, every message feels cold. With it, the bot can handle follow-up questions like βand what about the second point?β properly.
Whatβs next
Things Iβm planning to add:
-
Per-user knowledge bases (right now everyone hits the same index)
-
Image generation responses (not just image understanding)
-
A simple admin command (
/reindex) to trigger the ingestion workflow from WhatsApp itself
Happy to share more details on any specific part of the build just drop a comment. If youβre building something similar and got stuck on the WhatsApp media handling or MongoDB vector index setup, Iβm happy to help.
Cheers ![]()
