The popularity of n8n is on the rise, primarily because it offers a convenient platform for building text-based AI chatbots, such as those integrated with Telegram. The next frontier is the development of voice-based AI chatbots. Relying on outdated methods—like converting speech to text, processing it through a large language model (LLM), synthesizing the response back into speech, and then delivering it—seems inefficient. Modern neural network models are now capable of processing voice inputs directly, eliminating the need for intermediate transcription and synthesis steps. Moreover, integrating voice capabilities is feasible through technologies like WebRTC.
Is there hope for the near future to have an AI agent that can accept voice input directly, without the necessity of converting it to text first? Implementing such a system, especially when combined with dynamic tools, would represent a significant advancement in AI automation.