Hi everyone, I’m evaluating n8n for building a real-time AI voice agent using self-hosted STT, LLM, and TTS models, and I’d like to understand if it’s suitable for low-latency production use or mainly for async workflows. Could you please clarify:
-
Can n8n handle real-time pipelines (<1–1.5s latency) for voice interactions?
-
Does it support parallel execution or only sequential node processing?
-
What kind of latency overhead should I expect per node/API call (even on same server)?
-
Is streaming (partial STT/TTS responses) supported or only full request-response cycles?
-
Any best practices to reduce latency (parallelism, fewer API hops, etc.)?
-
Would you recommend using n8n as the core voice orchestration engine, or only in a hybrid setup with a real-time backend (e.g., FastAPI/WebSockets/LiveKit)?
1 Like
Hi @Amit_Tomar Welcome!
I have built multiple voice agents and real time systems using n8n and 11Labs, although n8n is very suitable for this use case but something which you have mentioned that “Self Hosted models” i am really not sure about that as when we are talking about voice agents and real time systems we need very powerful models and for that i normally do not think so people can host powerful models and even if they host the open sourced self hosting models are not as powerful as GPT models and all, i recommend using powerful models as these systems require AI agents to tool call a lot and to manage everything in one session so i recommend not using self hosted models, else you can really build systems like that using n8n below is the best example of something like that:
1 Like
the core issue is that n8n processes nodes sequentially in a request/response model and doesn’t support streaming natively – so STT partial outputs or TTS chunked audio aren’t really an option out of the box. for parallel execution you can fan out via sub-workflows but there’s coordination overhead. hitting <1s end-to-end for a full voice turn is going to be really tough with n8n as the sole orchestrator – the execution queue alone adds latency. the hybrid setup makes more sense: livekit or a fastapi websocket server handles the real-time audio pipeline, and n8n handles the async side (booking confirmations, crm updates, logging).
1 Like
@Benjamin_Behrens nailed it. one practical note on the hybrid setup: n8n connects to Vapi/LiveKit via webhook — Vapi fires a webhook on call start/end/tool call, n8n handles it and fires back a response. the latency-critical STT-LLM-TTS loop never touches n8n, only the business logic does (check calendar, update CRM, send confirmation).
i’ve built this with Vapi as the real-time layer and n8n handling the tool calls via webhook — works well in production. if you’re going fully self-hosted, LiveKit Agents is the better fit for the real-time core.
1 Like
@Pavel_Kuzko thanks for the concrete production example – the Vapi webhook pattern is exactly the right abstraction. n8n only enters the picture after the real-time loop completes a tool call, so the latency budget stays intact. for fully self-hosted setups, LiveKit Agents + n8n webhooks achieves the same split without Vapi in the middle.
1 Like