Hello everyone — I’m building a training bot for my sales team that must answer only from a single RTF file containing well-structured customer objections and the exact, approved rebuttals. In text mode (I upload the RTF and query via a custom GPT) the bot reliably returns the exact passages or correctly cites the document. Problem solved there.
However, when I switch to voice mode the assistant repeatedly hallucinates, invents words and phrases, paraphrases or fabricates answers that are not in the RTF. This makes the voice trainer useless.
What I need help with
-
Tools, architectures or concrete recipes that let me build a voice (speech) agent that is strictly grounded in one or more document/database, and will only return exact text excerpt(s) from those documents.
-
Examples or tutorials (ideally hands-on, with code or config) showing a tool where the output is verbatim from the source.
-
I’m specifically looking for a relatively simple, preferably low-code or no-code solution, as I do not want to build and maintain a fully custom voice + RAG pipeline from scratch.
-
I live and work in Germany, so GDPR-compliant would be a bonus.
What I’ve tried / observed
-
RTF ingestion + chunking + text queries → works very reliably when used in text-only mode (I can reproduce exact lines from the RTF).
-
Voice flow often answers without citing any retrieved chunk, paraphrases, or invents whole sentences that do not exist in the file. ASR transcription quality seems fine — the problem appears to be the lack of strict grounding or the pipeline allowing free generation.
What do you think about it? I like to also add n8n in this project, if possible.
Thanks in advance
Daniel