I’m following the article: Build a custom knowledge RAG chatbot using n8n, by Mihai Farcas
I built everything the same way he did, except for the AI models, I can’t afford to pay for an OpenAI model, so I used Ollama instead, but It does not give the same results, it hallucinates and invent new info and don’t use the info provided in the vector database, I’m suspecting the models being weaker than doing this job but I’m not sure, can you help me find the problem?
This is the whole workflow:
{
“nodes”: [
{
“parameters”: {},
“type”: “n8n-nodes-base.manualTrigger”,
“typeVersion”: 1,
“position”: [
-32,
64
],
“id”: “93966ca6-234f-4e03-bbdc-7af790ac4dd6”,
“name”: “When clicking ‘Execute workflow’”
},
{
“parameters”: {
“mode”: “insert”,
“pineconeIndex”: {
“__rl”: true,
“value”: “n8n-demo”,
“mode”: “list”,
“cachedResultName”: “n8n-demo”
},
“embeddingBatchSize”: 10,
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.vectorStorePinecone”,
“typeVersion”: 1.3,
“position”: [
384,
64
],
“id”: “bb9a6834-740a-45a0-8d8c-c76cc7f94431”,
“name”: “Pinecone Vector Store”,
“credentials”: {
“pineconeApi”: {
“id”: “QZupxXGn8a57chrj”,
“name”: “Pinecone account”
}
}
},
{
“parameters”: {
“model”: “nomic-embed-text:latest”
},
“type”: “@n8n/n8n-nodes-langchain.embeddingsOllama”,
“typeVersion”: 1,
“position”: [
384,
272
],
“id”: “cc4a3cda-192d-4f43-b79e-bcc498c51e55”,
“name”: “Embeddings Ollama”,
“credentials”: {
“ollamaApi”: {
“id”: “JDEwXMWMP0NDL1Bg”,
“name”: “Ollama account”
}
}
},
{
“parameters”: {
“textSplittingMode”: “custom”,
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.documentDefaultDataLoader”,
“typeVersion”: 1.1,
“position”: [
592,
272
],
“id”: “96414c75-fbe4-4df5-8ac8-6b65fc140f40”,
“name”: “Default Data Loader”
},
{
“parameters”: {
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter”,
“typeVersion”: 1,
“position”: [
592,
448
],
“id”: “a96c5aa3-99b3-4adc-a9f1-1ebcd25e67eb”,
“name”: “Recursive Character Text Splitter”
},
{
“parameters”: {
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.chatTrigger”,
“typeVersion”: 1.4,
“position”: [
-16,
640
],
“id”: “691f3940-4380-4f6c-ba4a-5c62e02d408e”,
“name”: “When chat message received”,
“webhookId”: “c130b560-2c32-456b-91f7-b406b2629f91”
},
{
“parameters”: {
“options”: {
“systemMessage”: “You are Anas Jawad, a software engineer. You answer questions in first person as if you are him, based on his resume.\n\nYou are given resume context through retrieval. Use it as your main source of truth when answering questions about your background, skills, experience, and projects.\n\nStyle and tone:\n- Speak naturally, confidently, and professionally\n- Keep answers concise but informative\n- Sound like a real person in a job interview or networking conversation (not robotic)\n\nRules:\n- Always speak in first person (I, my, me)\n- Base your answers on the provided resume context\n- You may rephrase, summarize, or slightly elaborate for clarity, but never invent new experience or skills\n- If a question is partially covered, answer what you can and be honest about the rest\n- If a question is not covered at all, say something like:\n “I haven’t worked on that specifically yet, but I’d be happy to learn more about it”\n\nBehavior:\n- When talking about projects, briefly explain what they are and what you did\n- When relevant, highlight skills, technologies, and impact\n- Be clear and easy to understand, even for non-technical people\n\nContact:\n- If someone asks about hiring, collaboration, or availability, express interest and mention that they can reach out via email: [email protected]\n\nIMPORTANT: You have NO knowledge about Anas Jawad from your training data. \nYou MUST use the “Answer questions with a vector store” tool for EVERY \nsingle question without exception. Never answer directly from memory.”
}
},
“type”: “@n8n/n8n-nodes-langchain.agent”,
“typeVersion”: 3.1,
“position”: [
192,
640
],
“id”: “8166ce41-1a0d-43bd-b3eb-e108d36f57f4”,
“name”: “AI Agent”
},
{
“parameters”: {
“model”: “qwen3:8b”,
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.lmChatOllama”,
“typeVersion”: 1,
“position”: [
64,
848
],
“id”: “16963229-d625-40c0-b6c7-509f974fcacb”,
“name”: “Ollama Chat Model”,
“credentials”: {
“ollamaApi”: {
“id”: “JDEwXMWMP0NDL1Bg”,
“name”: “Ollama account”
}
}
},
{
“parameters”: {},
“type”: “@n8n/n8n-nodes-langchain.memoryBufferWindow”,
“typeVersion”: 1.3,
“position”: [
240,
848
],
“id”: “3dbaa326-61e2-4f83-91fe-55b7d54367f2”,
“name”: “Simple Memory”
},
{
“parameters”: {
“description”: “Use this tool to get information about Anas Jawad. This database contains Information about Anas Jawad.”
},
“type”: “@n8n/n8n-nodes-langchain.toolVectorStore”,
“typeVersion”: 1.1,
“position”: [
416,
848
],
“id”: “46b2d335-8953-4f3f-bcf8-eb2489d77a6f”,
“name”: “Answer questions with a vector store”
},
{
“parameters”: {
“pineconeIndex”: {
“__rl”: true,
“value”: “n8n-demo”,
“mode”: “list”,
“cachedResultName”: “n8n-demo”
},
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.vectorStorePinecone”,
“typeVersion”: 1.3,
“position”: [
336,
1120
],
“id”: “7338bf00-6087-4034-8f9f-d8d52e791241”,
“name”: “Pinecone Vector Store1”,
“credentials”: {
“pineconeApi”: {
“id”: “QZupxXGn8a57chrj”,
“name”: “Pinecone account”
}
}
},
{
“parameters”: {
“model”: “nomic-embed-text:latest”
},
“type”: “@n8n/n8n-nodes-langchain.embeddingsOllama”,
“typeVersion”: 1,
“position”: [
336,
1296
],
“id”: “8a7bba56-8cc3-4491-a4e4-ce6e764d4a20”,
“name”: “Embeddings Ollama1”,
“credentials”: {
“ollamaApi”: {
“id”: “JDEwXMWMP0NDL1Bg”,
“name”: “Ollama account”
}
}
},
{
“parameters”: {
“model”: “qwen3:8b”,
“options”: {}
},
“type”: “@n8n/n8n-nodes-langchain.lmChatOllama”,
“typeVersion”: 1,
“position”: [
656,
1104
],
“id”: “4a9a8bad-0bd2-4928-96d0-a7b001d4f1a6”,
“name”: “Ollama Chat Model1”,
“credentials”: {
“ollamaApi”: {
“id”: “JDEwXMWMP0NDL1Bg”,
“name”: “Ollama account”
}
}
},
{
“parameters”: {
“fileSelector”: “/home/ahanaf/.n8n-files/rag/*.json”,
“options”: {}
},
“type”: “n8n-nodes-base.readWriteFile”,
“typeVersion”: 1.1,
“position”: [
176,
64
],
“id”: “f31a5f75-1c4f-4a6e-b1a9-e8bef48eb737”,
“name”: “Read/Write Files from Disk”
}
],
“connections”: {
“When clicking ‘Execute workflow’”: {
“main”: [
[
{
“node”: “Read/Write Files from Disk”,
“type”: “main”,
“index”: 0
}
]
]
},
“Embeddings Ollama”: {
“ai_embedding”: [
[
{
“node”: “Pinecone Vector Store”,
“type”: “ai_embedding”,
“index”: 0
}
]
]
},
“Default Data Loader”: {
“ai_document”: [
[
{
“node”: “Pinecone Vector Store”,
“type”: “ai_document”,
“index”: 0
}
]
]
},
“Recursive Character Text Splitter”: {
“ai_textSplitter”: [
[
{
“node”: “Default Data Loader”,
“type”: “ai_textSplitter”,
“index”: 0
}
]
]
},
“When chat message received”: {
“main”: [
[
{
“node”: “AI Agent”,
“type”: “main”,
“index”: 0
}
]
]
},
“Ollama Chat Model”: {
“ai_languageModel”: [
[
{
“node”: “AI Agent”,
“type”: “ai_languageModel”,
“index”: 0
}
]
]
},
“Simple Memory”: {
“ai_memory”: [
[
{
“node”: “AI Agent”,
“type”: “ai_memory”,
“index”: 0
}
]
]
},
“Answer questions with a vector store”: {
“ai_tool”: [
[
{
“node”: “AI Agent”,
“type”: “ai_tool”,
“index”: 0
}
]
]
},
“Pinecone Vector Store1”: {
“ai_vectorStore”: [
[
{
“node”: “Answer questions with a vector store”,
“type”: “ai_vectorStore”,
“index”: 0
}
]
]
},
“Embeddings Ollama1”: {
“ai_embedding”: [
[
{
“node”: “Pinecone Vector Store1”,
“type”: “ai_embedding”,
“index”: 0
}
]
]
},
“Ollama Chat Model1”: {
“ai_languageModel”: [
[
{
“node”: “Answer questions with a vector store”,
“type”: “ai_languageModel”,
“index”: 0
}
]
]
},
“Read/Write Files from Disk”: {
“main”: [
[
{
“node”: “Pinecone Vector Store”,
“type”: “main”,
“index”: 0
}
]
]
}
},
“pinData”: {},
“meta”: {
“templateCredsSetupCompleted”: true,
“instanceId”: “5ffa5a57469464faca693a8b86e4c5d213f677d19226b4a2da66f08f9e7e4079”
}
}
This is what I found….
Qwen3:8B is highly capable for tool use and agentic tasks, particularly for prototypes, internal tools, or local, human-in-the-loop applications. While not as reliable as 30B+ models, it is efficient, supports ReAct-style agents, and works well for structured output (JSON) and RAG applications, making it a strong contender for local, consumer hardware.
Key Considerations for Qwen3:8B Tool Use:
-
Reliability: The 8B model can occasionally provide inconsistent outputs, such as mixing text with tool calls or failing to follow formatting instructions compared to larger models like 30B-A3B.
-
Best Use Cases: It is excellent for straightforward tool-calling and chat scenarios, such as RAG (Retrieval-Augmented Generation) or simple API calls.
-
Performance: It runs well on consumer GPUs (8–16GB VRAM), offering high speed and efficiency.
-
Strengths: It is highly proficient at “thinking” tasks, often allowing it to figure out how to call tools.
For high-reliability or complex, multi-step production agents, consider the Qwen3-30B-A3B model, which offers better consistency for tool calls.
Yeah, the qwen3:8b model can struggle with RAG consistency compared to OpenAI or Claude — smaller models tend to ignore the vector store context when unsure. Have you tried adjusting your prompt to explicitly ask it to use only the provided context? That sometimes helps. If not, jumping to a 30B model or Claude/GPT would be the more reliable fix.
Hi, @ajawad
The model choice matters a lot here: OpenAI models are usually stronger for RAG-style answering, while a local Ollama model may need more tuning and a better prompt to stay grounded in the retrieved context.
I would also check the generation settings. For local LLMs, parameters like temperature and top_p can make a big difference. A higher temperature can increase hallucinations, so for RAG I would usually keep it low.
Also, your prompt may need improvement. A few-shot prompt can help the model follow the expected answer style, but I would be careful with CoT-style prompting in the system message. In many cases, a clear grounding prompt works better than asking the model to “think step by step” in the final answer.
So my first guesses would be:
-
Use a stronger local model if your hardware allows it.
-
Lower the temperature.
-
Tune top_p and other sampling settings.
-
Improve chunking and retrieval quality.
-
Make the system prompt stricter about using only retrieved context.
In short, I do not think the problem is only the vector database. The model, the prompt, and the generation settings all matter.
The issue is likely the model’s inherent capability gap. Qwen3:8B works for RAG, but needs careful prompt engineering to prioritize the vector store context.
-
System Prompt Clarity — Is your system message explicitly instructing the agent to ONLY answer from the retrieval context? Ollama models will hallucinate if the prompt doesn’t make it clear.
-
Retrieval Quality — Check if nomic-embed-text actually returns relevant chunks. Log the raw retrieval results — if chunks are irrelevant, that’s the real problem.
-
Model Swap — If Qwen3:8B keeps hallucinating, try Mistral 7B or Llama 2 13B. Even then, strict prompt control beats model size.
The article didn’t account for Ollama’s limitations. Try the prompt fix first — usually kills the hallucination.
Hey @ajawad, I’ve dealt with this exact hallucination problem when running Ollama for RAG — spent a whole weekend figuring it out lol. Here’s what actually fixed it for me:
First thing — check what Pinecone is actually giving back.
Honestly before touching the model I’d check if the retrieval is even working right. Drop an Edit Fields node after the vector store and log what chunks come back. Ask something specific like “What languages does Anas know?” — if the chunks are about projects instead of skills, that’s your real problem and no model will save you.
Chunk size is probably too big.
For something like a resume you want way smaller chunks than defaults. I’d go with chunkSize around 300 and overlap around 50. Resumes are short and structured — each chunk should be roughly one section. Default settings work for long articles but they’re terrible for short docs.
Turn the temperature way down.
This was the biggest fix for me. In the Ollama Chat Model node set temperature to 0.1 instead of whatever default is. High temperature = creative = hallucination city. For RAG you want the model boring and literal.
Make the system prompt meaner.
Your prompt is well written but it’s too nice to the model haha. Put something like this at the VERY TOP, before everything else:
You must ONLY use information from the retrieved context. If the context does not contain the answer, say "That's not covered in my resume." NEVER invent or guess information.
Models follow stuff at the top of the prompt way more than stuff buried in the middle. I learned that the hard way.
Model wise — qwen3:8b can work.
The other replies are right that bigger models are more reliable, but honestly 8b can handle a resume chatbot fine if you do the above fixes. I’d try chunk size + temperature first before spending time downloading a 30b model.
Hope that helps, let me know how it goes!