I’ve built a video translation and subtitling workflow using n8n, which can be fully deployed locally without relying on any external APIs. Here’s a summary of the steps:
-
Extract text from the video using Whisper.
-
Translate the text using a large language model (you can use a smaller model running on Ollama for this).
-
Generate voice audio for each translated line and calculate the duration of each line to produce a JSON file.
-
For text-to-speech, I used index tts2, which works well for both Chinese and English — and likely supports Spanish too.
-
Based on the audio durations, generate an ASS subtitle file.
-
Replace the original audio in the video with the newly generated voice audio.
-
Add the generated subtitles to the video.
My workflow is relatively simple for now, and I’d really appreciate feedback and suggestions from everyone.
