Handling WhatsApp Voice Notes in n8n

I built a simple n8n workflow to handle both text messages and voice notes coming from WhatsApp.

The reason was straightforward: most WhatsApp automations work fine with text, but real users often send voice notes, incomplete messages, or mixed-language input. That usually breaks the flow.

So I designed the workflow to first check the message type.

If it’s a text message, it goes directly for processing.
If it’s a voice note, the workflow fetches the audio file, transcribes it, and then processes the transcript the same way as text.

That keeps the logic clean and makes the system much more practical for real-world use.

What I like about this setup is that it adapts to how people naturally communicate instead of forcing them into a strict format.

This kind of workflow can be useful for:

  • ordering systems

  • support flows

  • appointment booking

  • procurement requests

For me, the main value here is not just sending an AI reply.
It’s turning messy input like voice notes into something structured enough to actually use inside a workflow.

Would love to see how others are handling audio-based WhatsApp use cases in n8n.

Good pattern. A few things that come up once voice notes are in production:

Transcription fallback: Short or noisy recordings often come back as gibberish or fail entirely. Worth wrapping the transcription step with error handling — if the result is under ~5 words or empty, send back “Sorry, I couldn’t understand that audio — could you type your message?” rather than passing garbage to the AI.

Media download timing: WhatsApp Cloud API media URLs expire after a few minutes. If your workflow has any delay between receiving the webhook and fetching the audio, you can hit a 401. Fetch the media binary immediately in the first node after the trigger, before any buffering or dedup logic.

Multilingual: as Benjamin mentioned, Whisper defaults to auto-detection but can be nudged with a language param if you know your user base (e.g., pt for Brazilian Portuguese). Cuts errors significantly for non-English voices.

Unified normalization: what you described — normalizing all types to a single content field at entry — is the right call. We run the same pattern in production: text gets passed through directly, voice gets transcribed into text, images get captioned. Same AI agent handles everything downstream.

Good points — I agree.

I’ll update the flow to fetch WhatsApp media immediately after the webhook, add fallback handling for short/failed transcriptions, and avoid sending bad audio text into the AI.

I’ll also keep the unified normalization pattern so text, voice, and images all become one clean content field before the AI agent.