Voice mode of my document-grounded chatbot hallucinates — text mode works, voice mode invents answers. Help?

Daniel142 · January 17, 2026, 9:47am

Hello everyone — I’m building a training bot for my sales team that must answer only from a single RTF file containing well-structured customer objections and the exact, approved rebuttals. In text mode (I upload the RTF and query via a custom GPT) the bot reliably returns the exact passages or correctly cites the document. Problem solved there.

However, when I switch to voice mode the assistant repeatedly hallucinates, invents words and phrases, paraphrases or fabricates answers that are not in the RTF. This makes the voice trainer useless.

What I need help with

Tools, architectures or concrete recipes that let me build a voice (speech) agent that is strictly grounded in one or more document/database, and will only return exact text excerpt(s) from those documents.
Examples or tutorials (ideally hands-on, with code or config) showing a tool where the output is verbatim from the source.
I’m specifically looking for a relatively simple, preferably low-code or no-code solution, as I do not want to build and maintain a fully custom voice + RAG pipeline from scratch.
I live and work in Germany, so GDPR-compliant would be a bonus.

What I’ve tried / observed

RTF ingestion + chunking + text queries → works very reliably when used in text-only mode (I can reproduce exact lines from the RTF).
Voice flow often answers without citing any retrieved chunk, paraphrases, or invents whole sentences that do not exist in the file. ASR transcription quality seems fine — the problem appears to be the lack of strict grounding or the pipeline allowing free generation.

What do you think about it? I like to also add n8n in this project, if possible.

Thanks in advance
Daniel

achamm · January 17, 2026, 9:54am

achamm:

im not sure whats going on but its like my voice model is getting confused between text mode and speech mode. i’ve tried using the n8n-node tool to get a hands-on look at how it works with different inputs, but it seems that the output is all over the place. have u seen anything similar?