Hey everyone,
I recently built a completely zero-touch, autonomous media pipeline inside n8n and wanted to share the architecture with the community.
The goal was to build a faceless AI YouTube channel that actually maintains 100% character visual consistency (which is usually the biggest failure point in AI video generation).
The Stack:
- n8n (Core orchestration)
- Telegram Bot (Trigger & Final Delivery)
- Gemini 3.1 Multimodal (Director & Image Translation)
- OpenRouter (Unified API billing for generation & TTS)
- Python / FFmpeg (Video Compiler)
How the “Brain” works:
Instead of using text prompts to generate characters (which always hallucinates), the n8n Agent receives a base static image of the character. It forces Gemini 3.1 to act strictly as an image-translator, altering only the facial expressions and environment based on scraped news data, leaving the core 3D geometry untouched.
The “Muscle” (Bypassing FFmpeg Jitter):
If you’ve automated video with FFmpeg, you know the zoompan filter creates a terrible, jittery sub-pixel mess. I built a Python node to execute an Oversampling bypass: It scales the raw generated image up to 4K, runs the camera movement on the massive pixel density, and downscales it back to 1080p. The result is buttery smooth.
I recorded a full teardown of the nodes and the final rendered output here:(https://www.youtube.com/watch?v=qRJN9VVqy0g)
Here is the raw AI Director. If you want the complete, connected workspace with all the sub-agents and routing, I’ve put the full JSON file here for $0:
( Project Stickman: Autonomous AI Studio (n8n Workflow) )
Let me know if you have any questions about the data synthesizer agent or the FFmpeg logic. Would love to hear how you guys handle complex media rendering in n8n!