Hi everyone,
I’m trying to build a pipeline in n8n to extract structured business information from an audio call recording using an AI Agent.
Goal
Given an Audio URL, I want an AI agent to analyze the conversation and extract a fixed set of predefined data points into a strict JSON schema.
What I’ve tried so far
-
HTTP Request node
-
Used to download the audio file from the URL
-
Response format set to File
-
Binary data is generated correctly (I can play/download the audio from the Binary tab)
-
-
AI Agent node
-
Configured to use Gemini 3 Flash Preview via an LLM proxy (LiteLLM)
-
Passed:
-
A detailed system prompt
-
A detailed user prompt
-
A strict output JSON schema
-
-
Enabled “Automatically parse binary files”
-
Tried passing:
-
Only the Audio URL
-
Only the audio binary
-
Both the Audio URL and binary together
-
-
-
Transcription nodes
-
I have already tried:
-
Gemini Transcribe node
-
OpenAI Transcribe node
-
-
These are not suitable for my use case because:
-
I do not just need a transcript
-
I need specific structured data points extracted directly from the call
-
The Gemini Transcribe node requires Google PaLM API credentials, while I am accessing Gemini models only through a LiteLLM proxy, not directly via Google APIs
-
-
What’s going wrong
-
The AI Agent:
-
Either returns empty JSON
-
Or returns random / hallucinated values
-
-
The behavior suggests that:
-
The audio is not actually being processed
-
Or the model is ignoring the audio input entirely
-
This happens even though:
-
The audio binary exists and is valid
-
MIME types are correct (e.g.
audio/mpeg) -
Gemini 3 Flash Preview claims to support audio inputs
What I suspect
One or more of the following may be true:
-
The AI Agent node does not truly support audio input, even when binary parsing is enabled
-
Gemini audio support may require direct Gemini API calls rather than an AI Agent abstraction
What I’m looking for help with
-
Has anyone successfully used:
-
AI Agent with audio input?
-
Gemini via LiteLLM for multimodal audio processing?
-
-
Is audio → AI Agent currently unsupported in n8n?
-
Is STT → extraction the recommended approach?
-
If single-step audio analysis is possible, what is the correct node / configuration?
Any confirmation, working examples, or architectural guidance would be greatly appreciated.
Thanks in advance!