Local voice AI agent with hybrid memory using n8n — looking for feedback

Hi everyone,

I’m experimenting with a local-first voice AI agent using n8n.

The current setup uses:

  • Python script for audio input/output

  • local STT and TTS services

  • n8n for the workflow

  • PostgreSQL for structured memory

  • pgvector for semantic long-term memory

  • LM Studio for local LLM inference

The loop is already working end-to-end:

voice input → STT → n8n → PostgreSQL/pgvector memory → local LLM → TTS → voice output

It still feels more like a voice-enabled chat than a natural real-time conversation, so I’m now working on latency, memory retrieval quality, context filtering, and observability.

I’ve documented the prototype here:

It is still a WorkInProgress, but I’d love feedback from the community, especially around workflow structure, hybrid memory patterns, pgvector retrieval. There is so much to improve and learn! :smiley:

Thanks!

Nice project. I would look at this as two separate problems: the realtime turn loop and the memory/observability layer.

For the turn loop, I would first measure every stage separately:

- audio capture → STT latency

- n8n workflow execution time

- memory retrieval time

- local LLM time to first token and full response

- TTS start time and full audio generation

That usually shows whether the system feels like a chat because of orchestration delay, model delay, or TTS delay. n8n is great as the orchestrator, but I would avoid putting anything slow or optional in the critical path of a voice turn.

For hybrid memory, I would keep three memory layers instead of one big retrieval step:

1. short rolling conversation buffer

2. structured state in Postgres, for stable facts/preferences/tasks

3. semantic recall from pgvector, with a small top_k and source/type tags

Then add a filter step before the LLM: which memories are actually relevant to this turn, and why? Logging that decision is very useful when the agent starts recalling odd context.

For observability, I would add a correlation_id per voice turn and write one row per stage with status, duration, and error. That makes it much easier to debug partial failures like “STT worked, memory retrieval worked, but TTS failed” without reading full private transcripts.

One design detail I would consider: write long-term memory after the response is generated, not before. That reduces the chance of noisy intermediate thoughts or failed turns getting stored as durable memory.

The combo of pgvector for semantic recall + PostgreSQL structured memory is a solid foundation for this kind of agent. One thing worth thinking about on the latency side: for short-context turns (e.g. follow-up questions where the user is clearly continuing from the last exchange), you could skip the pgvector retrieval entirely and just use the rolling buffer. The semantic lookup adds meaningful latency and often isn’t needed when the context is already fresh in the window.

Thank you for the comments!
I have already teste parallel processing the memory store and the TTS response, but setting it behind the response might be more efficient as you say.
At the minute the system works one round (user-response) everytime, since I wanted to make it work first. Now I will focus on the conversation loop process and improve processing time. The current time does not feel too long, like waiting but it is not a real-time conversation yet.

Thank you for your help!

Thank you for the tip! I will have it in consideration!