RAG chatbot speed

Hi everyone,
I’m building a RAG-based chatbot assistant for a website, self-hosted on a VPS using n8n. My current stack involves: AI Agent + OpenAI (gpt-4o-mini) + Pinecone vector database.
I’m struggling with high latency: it often takes 16-18 seconds to get a response after a user sends a message.
I’m looking to optimize the user experience and would love your advice on two points:

  1. Reducing Response Time: What are the best practices for speeding up n8n Agent and Pinecone queries? I’m already using gpt-4o-mini. Would shortening the system prompt, reducing the topK value in Pinecone, or switching from an AI Agent to a specific Chain (like Retrieval QA) make a significant difference?

  2. Custom Preloader/Status Messages: The default n8n chat widget only shows the three typing dots (typing indicator) during the wait. Is there a way to display custom status messages instead (e.g., “Searching database…”, “Formulating response…”) to keep the user engaged?

  3. Streaming: Does anyone have experience enabling “Stream Responses” in this setup? Can it effectively mask the latency by showing characters as they are generated?
    Server Specs: Hostinger, 4 vCPU, 16GB RAM

I’d appreciate any insights or tips on how to bring this delay down to a few seconds. Thanks in advance!

Describe the problem/error/question

What is the error message (if any)?

Please share your workflow

(Select the nodes on your canvas and use the keyboard shortcuts CMD+C/CTRL+C and CMD+V/CTRL+V to copy and paste the workflow.)

Share the output returned by the last node

Information on your n8n setup

  • n8n version:
  • Database (default: SQLite):
  • n8n EXECUTIONS_PROCESS setting (default: own, main):
  • Running n8n via (Docker, npm, n8n cloud, desktop app):
  • Operating system:

Hi @tamy.santos
Thanks, this makes sense. I’ll start by removing the AI Agent and switching to a single-step RAG flow, then reduce LLM calls and context size. Appreciate the clear breakdown :+1:

I hope everything goes well.
If this solution solved the issue for you, please consider leaving a like or marking the reply as the solution ( it helps others find the answer more easily and also supports community contributors.)

Hi @Svetislav_Kondic To reduce latency of RAG chatbot, reduce Pinecone work (lower topK, use smaller embeddings), simplify prompts, use more direct Retrieval QA chain instead of full AI Agent, enable streaming responses so tokens arrive as generated, and serve your UI via webhook/Respond to Webhook endpoint with custom status messages instead of default widget to boost perceived responsiveness, And make sure to not give a huge system prompt, for styling the webhooks HTML just use some static site generator that would help you out. Please do not consider AI generated responses like that and flag them as soon as possible.

Hi @Anshul_Namdev
Thanks for the advice! This is very helpful — I’ll simplify the flow, reduce Pinecone load, and switch to a more direct Retrieval QA approach. Appreciate the tips on streaming and UI responsiveness as well :+1:

Super! Let me know how that goes.