AI Agent unable to extract structured data from audio URL / binary (Gemini 3 Flash via LLM Proxy)

Hi everyone,

I’m trying to build a pipeline in n8n to extract structured business information from an audio call recording using an AI Agent.


Goal

Given an Audio URL, I want an AI agent to analyze the conversation and extract a fixed set of predefined data points into a strict JSON schema.


What I’ve tried so far

  1. HTTP Request node

    • Used to download the audio file from the URL

    • Response format set to File

    • Binary data is generated correctly (I can play/download the audio from the Binary tab)

  2. AI Agent node

    • Configured to use Gemini 3 Flash Preview via an LLM proxy (LiteLLM)

    • Passed:

      • A detailed system prompt

      • A detailed user prompt

      • A strict output JSON schema

    • Enabled “Automatically parse binary files”

    • Tried passing:

      • Only the Audio URL

      • Only the audio binary

      • Both the Audio URL and binary together

  3. Transcription nodes

    • I have already tried:

      • Gemini Transcribe node

      • OpenAI Transcribe node

    • These are not suitable for my use case because:

      • I do not just need a transcript

      • I need specific structured data points extracted directly from the call

      • The Gemini Transcribe node requires Google PaLM API credentials, while I am accessing Gemini models only through a LiteLLM proxy, not directly via Google APIs


What’s going wrong

  • The AI Agent:

    • Either returns empty JSON

    • Or returns random / hallucinated values

  • The behavior suggests that:

    • The audio is not actually being processed

    • Or the model is ignoring the audio input entirely

This happens even though:

  • The audio binary exists and is valid

  • MIME types are correct (e.g. audio/mpeg)

  • Gemini 3 Flash Preview claims to support audio inputs


What I suspect

One or more of the following may be true:

  • The AI Agent node does not truly support audio input, even when binary parsing is enabled

  • Gemini audio support may require direct Gemini API calls rather than an AI Agent abstraction


What I’m looking for help with

  • Has anyone successfully used:

    • AI Agent with audio input?

    • Gemini via LiteLLM for multimodal audio processing?

  • Is audio → AI Agent currently unsupported in n8n?

  • Is STT → extraction the recommended approach?

  • If single-step audio analysis is possible, what is the correct node / configuration?

Any confirmation, working examples, or architectural guidance would be greatly appreciated.

Thanks in advance!

hi, Kumar!

The AI agent node doesn’t support audio as input. Audio binaries are ignored semantically, wich is why the output is empty or hallucinated.

Correctly aproach: audio > transcription > Ai agent

Documentation: advanced-ai/ai-agent/ and workflows/data/binary-data/

If this helps, please consider marking this reply as the solution.

Hi @tamy.santos

Thanks for your response.

As i mentioned, Gemini Transcribe Node requires Gemini API through Google PaLM Account. Similarly, the OpenAI Transcription Node requires the OpenAI API.

I am utilising an in-house LiteLLM proxy server to access the models in my organisation. Even though the required models are available through the LiteLLM proxy, i am unable to integrate them in the aforementioned nodes.

Kindly suggest any workaround if available.

Thanks in advance!

One possible is to handle speech-to-text outside the built-in transcription nodes, for example by calling your LiteLLM STT endpoint via a custom HTTP Request, and then passing the resulting text to the AI Agent for structured extraction.
Let me know if this works, I’d be happy to know it.

1 Like

Hi @tamy.santos

Tried the same. Works with /audio/transcriptions endpoint of LiteLLM for STT. Used the transcriptions for downstream AI Agent.

Thanks a lot for the help.

1 Like

AI agent node doesn’t support audio as input. You should convert voice to text using openAI or with other transcribing tools

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.