Voice mode of my document-grounded chatbot hallucinates — text mode works, voice mode invents answers. Help?

Hello everyone — I’m building a training bot for my sales team that must answer only from a single RTF file containing well-structured customer objections and the exact, approved rebuttals. In text mode (I upload the RTF and query via a custom GPT) the bot reliably returns the exact passages or correctly cites the document. Problem solved there.

However, when I switch to voice mode the assistant repeatedly hallucinates, invents words and phrases, paraphrases or fabricates answers that are not in the RTF. This makes the voice trainer useless.

What I need help with

  • Tools, architectures or concrete recipes that let me build a voice (speech) agent that is strictly grounded in one or more document/database, and will only return exact text excerpt(s) from those documents.

  • Examples or tutorials (ideally hands-on, with code or config) showing a tool where the output is verbatim from the source.

  • I’m specifically looking for a relatively simple, preferably low-code or no-code solution, as I do not want to build and maintain a fully custom voice + RAG pipeline from scratch.

  • I live and work in Germany, so GDPR-compliant would be a bonus.

What I’ve tried / observed

  • RTF ingestion + chunking + text queries → works very reliably when used in text-only mode (I can reproduce exact lines from the RTF).

  • Voice flow often answers without citing any retrieved chunk, paraphrases, or invents whole sentences that do not exist in the file. ASR transcription quality seems fine — the problem appears to be the lack of strict grounding or the pipeline allowing free generation.

What do you think about it? I like to also add n8n in this project, if possible.

Thanks in advance
Daniel

achamm:

im not sure whats going on but its like my voice model is getting confused between text mode and speech mode. i’ve tried using the n8n-node tool to get a hands-on look at how it works with different inputs, but it seems that the output is all over the place. have u seen anything similar?