How Can I Build an n8n Workflow to Extract Audio Metadata and Transcribe Files Using Whisper?

I’m working on building an n8n workflow that begins with a form trigger node where users can upload either an audio or video file. The goal is to:

Detect the type of uploaded file (audio or video).

If it’s a video, convert it to audio.

Extract the audio duration and metadata.

Send the audio file to OpenAI’s Whisper API (or an equivalent speech-to-text service) to generate a transcription.
Can you guide me through designing this complete workflow, including tools or services for conversion, metadata extraction, and the Whisper API request integration?

1 Like

Hey,

A nice example of a flow is here https://n8n.io/workflows/3586-ai-powered-whatsapp-chatbot-for-text-voice-images-and-pdfs-with-memory/

So this can be easily edited to make what you want :slight_smile:

hope it helps.

1 Like

yeah it’s helpful

1 Like