Describe the problem/error/question
I’m building a multimodal AI assistant in n8n that receives fragmented messages from WhatsApp (text, audio, image, PDF), and I need to accumulate those messages using Redis so that they can be processed together after the user finishes sending.
The goal is to simulate a smart buffer, where each input is temporarily stored and, after a short delay (like 6s), all parts are retrieved, parsed, merged, and sent to OpenAI GPT-4o for response generation.
I’m using the Redis
node (via Upstash) to Push
each message (as JSON.stringify($json)
), and then I Get
them back to reconstruct the context.
There is no error message, but i can’t think in a solution, and the jsons end up not matching, in the node if.
I wanna create an humanized ai agent for a medical clinic,.
One of the most important things I’ve realized while developing a conversational AI system is that people rarely send everything they want to say in a single message.
Especially in messaging apps like WhatsApp, it’s common for users to break up their thoughts — they might start with a short text, follow up with a voice note, then add an image or even a PDF. This is natural human behavior. Yet most automation flows or AI integrations are designed to respond immediately to each incoming message, without waiting or checking if there’s more coming.
That leads to shallow responses, lack of context, and a poor user experience.
To solve this, I implemented a buffering mechanism that listens across different types of media. Whenever the user sends something — whether it’s a text, audio, image, or document — it gets processed into plain text and temporarily stored. A short wait period (like 6 seconds) gives the user time to finish expressing themselves. Only after that, all the accumulated pieces are retrieved and analyzed together as a single context.
This allows the AI to fully understand the intention behind the conversation, rather than responding prematurely based on incomplete input.
It also handles mixed media seamlessly. For example, the user can send a message like:
- A short text saying they want an appointment
- A voice note with the preferred time
- A photo of an exam result
- A PDF with their medical history
Each of those is individually processed — voice is transcribed, images are interpreted using vision models, documents are parsed — and then everything is merged into one unified input for the AI to reason over.
This gives the assistant the ability to respond with depth, clarity, and context-awareness. It doesn’t just reply to a message. It responds to the whole situation.
And that, I believe, is what separates a real AI experience from a chatbot that just reacts.
If anyone can help me, I will be very grateful.