Optimizing Content Automation: Best Way to Extract and Process Transcriptions in N8N?

Giulia_Patricio · March 18, 2025, 8:13pm

Hello, everyone!

I’m looking for a way to set up a workflow in N8N where I can upload an MP4 file, extract its transcription, and then use that transcription to generate new editorial content streams with the help of automated agents.

Does anyone have any suggestions on how to structure this first part of the workflow?

Thanks in advance!

jksr · March 18, 2025, 8:45pm

Hey @Giulia_Patricio welcome to the community!

How do you imagine triggering your workflow?
You could do

Upload via n8n form
Upload via POST Request
Upload from third party app like drive or telegram

Next, you can use OpenAI Whisper (search for “Transcribe Recording”) to get your file transcribed. Once you have the transcript, you can process it with LLM calls / Agents. Did I understand it right that by “first part of the workflow” you mean the transcription?

If you want another option I have also used groq hosted whisper but for that you would have to use a custom http node

Giulia_Patricio · March 19, 2025, 1:04pm

Hello, @jksr

Actually, when I refer to the first part of the workflow, I mean how I can automatically send or receive video files.

I imagine one option would be to upload the file within a form trigger, which would then initiate the workflow. Do you think this approach would work?

After that, can I directly connect this workflow to an OpenAI node and link it to the other agents to complete the process?

Thanks in advance!

jksr · March 19, 2025, 1:43pm

Ahh sorry I misread the file type! So I think understanding videos can be done by both gtp4o and gemini 2.0. But its not an easy implementation.

It reads like you have to break down the videos into frames

github.com/openai/openai-cookbook

examples/GPT_with_vision_for_video_understanding.ipynb

main

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing and narrating a video with GPT-4o's visual capabilities and the TTS API\n",
    "\n",
    "This notebook demonstrates how to use GPT's visual capabilities with a video. GPT-4o doesn't take videos as input directly, but we can use vision and the 128K context window to describe the static frames of a whole video at once. We'll walk through two examples:\n",
    "\n",
    "1. Using GPT-4o to get a description of a video\n",
    "2. Generating a voiceover for a video with GPT-o and the TTS API\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [

This file has been truncated. show original

It does work with inline data for gemini (video must be smaller than 20MB)
See as follows:

So I think uploading the file can easily be done with a form trigger but if its a big video you will need an implementation that matches what they suggest in the post.

Giulia_Patricio · March 19, 2025, 5:08pm

Thank you very much for your help! Do you have any recommendations on how to set up the best approach for AI assistants to work on this content?

jksr · March 19, 2025, 7:28pm

Whats your goal? I would go at least in two steps. Raw extraction from the source file and summarizing, classifying etc.

Giulia_Patricio · March 19, 2025, 11:49pm

To create new editorial streams (SEO-optimized blog posts, Instagram posts, and video scripts) based on the transcription of the uploaded videos.

Giulia_Patricio · March 19, 2025, 11:54pm

Thinking about simplifying the workflow to run an initial POC, do you think there would be a way for me to simply input an already transcribed material and, from that text, have AI-connected agents generate new content streams for me?

jksr · March 20, 2025, 7:45am

Sure, I mean working with text is the easiest. Checkout some of the example workflows as well to see how others have done content generation before Discover 1516 Automation Workflows from the n8n's Community or Discover 1516 Automation Workflows from the n8n's Community