Translate Videos Locally Using Whisper, Index TTS, and n8n

I’ve built a video translation and subtitling workflow using n8n, which can be fully deployed locally without relying on any external APIs. Here’s a summary of the steps:

  1. Extract text from the video using Whisper.

  2. Translate the text using a large language model (you can use a smaller model running on Ollama for this).

  3. Generate voice audio for each translated line and calculate the duration of each line to produce a JSON file.

  4. For text-to-speech, I used index tts2, which works well for both Chinese and English — and likely supports Spanish too.

  5. Based on the audio durations, generate an ASS subtitle file.

  6. Replace the original audio in the video with the newly generated voice audio.

  7. Add the generated subtitles to the video.

My workflow is relatively simple for now, and I’d really appreciate feedback and suggestions from everyone.

3 Likes

I’m interested in this workflow. Could you provide a more detailed template? Thanks!