N8n + Ollama + Thinking + MCP servers = very bad experience

Just to set the context of this. I am a extremely technical person. I run my own lab at my house. I have n8n running in a docker on my TrueNAS Scale server. I am using Ollama for inference locally. I run other AI tools without issues such as Opencode and Open WebUI.

I started using n8n to learn it. I have a pretty good understand for working in it for 3 days and I just want to see if this is a me issue or if n8n just has very poor Ollama support.

I have been working on what I would consider a very simple workflow compared to the langflow workflows I have setup. Its a simple Discord bot I setup on my lab discord server. I have a role setup to assign to people so they can query it. The goal is just an experiment to see how quickly I could get something in place.

I have been having, what I would consider, massive stability issues with n8n.

1: Multiple executions every time I publish a change to my workflow.

2: Really really bad Ollama support. I have tried around 10 different models and very few work properly with the “AI Agent” backed with a Ollama model. I have multiple MCP servers connected as tools via MetaMCP and 9 time out of 10, n8n does not manage the tool calling correctly and just dumps the tool calling into text and pushing the workflow forward.

3: Poor auditing. A number of times I will have failed workflow executions with no data in the nodes I can review to figure out what happened.

4: Lack of support for parallel execution unless you use sub-workflows. I guess this is fine but I consider this to be a work around rather then the correct way to handle this.

In the end, I just get the feeling n8n would not fit well into a production environment if you are using Ollama as an inference source. Is it me or is this what others feel as well? As a so-called “AI Workflow Engine” I feel the software is lacking in basic key features and stability. As it sits, I could never recommend this to any of the companies that I do business with due to these issues I ran into.

Am I alone in this assessment?

1 Like

What are the models? Do they support tool use?

Did you declare any tooling schemas?

Did you declare any error workflow?

You seem to be using a hammer and treating every issue as a “nail“.

1 Like

1: Yes, every model supports tool calling. The model set that worked the best have been qwen3.5 4b+ models but I tried qwen3.6, Deekseek-R1, gemma4 and glm-4.7-flash of various sizes. They all have the issue I described.

2: I’m using the “Ai Agent” with a Tool node attached to it. The JSON tool schema is being produced by the model and pushed as text instead of being picked up as a Tool call by the Tool node. If n8n has a specific Tool Schema I need to direct the model to use, I’m not aware of it.

3: Why would I need an error workflow if the data should show under the nodes when an execution fails? I would think that’s more for error reporting via a ticketing system but if no data is in the nodes, why would I expect that data to be picked up by a error workflow?

I have no idea why you would think the hammer analogy would apply to my post. Admittedly, I’m new to n8n but I do not think I’m miss-using the nodes or some how doing something outside the norm of what n8n should be able to do. Are you saying I’m using n8n in a way it wasn’t ment to be used?

If you think I’m wrong please enlighten me. I’m here to learn.

2 Likes

My next question is: Do they support tool calling reliably?

Some models do not support tool calling reliably

ChatGPT and Claude models are not recommended for no reason. They just work

So does this tool schema conform to the “OpenAI format”?

I had to learn this through the hard way.

This is more for debugging purpose.

You are not mis-using, you are just having a different expectation.

n8n is very powerful and very flexible. As a result, you have to find a proper way to harness it.

I am still learning along the way right now

Ah…… gemma4(google model) are known for unreliable tool calls due to non-conformance of their tool call declaration. You should be able resolve this issue by building a LiteLLM proxy around it

You need a bigger model to make better tool calls

1 Like

But like, I get it, maybe it is a family of models issue. I thought of that. That’s why I tried so many different ones. I’m open to recommending a model that ollama supports. The one I found to be reliable is qwen3.5:4b+ but those models seem to be fairly slow

The scope of my lab is to not use cloud based services. Any cloud based service defeats the point of what I’m trying to do.

I do have to point out that MCP calling does work fine in opencode and in open webui with some of the models that didn’t seem to work in n8n. I have a hard time blaming the model in this case but I am open to the “out-of-spec” issue as thats fairly common in engineering and I do deal with this quite a bit.

If you have a recommendation for a model that runs on a Tesla V100 32gb card that would be fast and works better with tool calling in n8n. I’d be very happy to give it a try.

As far as needing a bigger model for calling tools? That’s one of the models that been 100% reliable in calling tools with-in n8n, so I do not think the size of the model is playing a part in the issue.

2 Likes

You can try gemma4:e4b or gemma4:e2b but you will need LiteLLM to “improve“ the tool calls

1 Like

Hi everyone :waving_hand: I’m really happy :grinning_face: to join this conversation, and thank you both (@Christopher_Powell , @kjooleng ) for sharing your insights.

I think the main issue here is not only “does the model support tools?”, but also whether tool calling is reliable, whether the schema conforms properly, and how easy it is to debug and audit when something goes wrong.

In my experience, architecture makes a big difference too. (1) Prompt engineering, (2) context engineering, and (3) model parameters can change the result a lot, and sometimes (4) a planning first approach can help. For example, separating planning and execution more clearly, (5) “ReWOO” style, can sometimes reduce tool call confusion and make the flow easier to reason about.

I also think (6) hardware :desktop_computer: should not be overlooked here.

When we run LLMs locally, the result is not only about the model or the prompt. CPU, RAM, VRAM, disk speed, and even overall system load can change latency, stability, and tool calling behavior quite a lot. In local development, the same workflow can feel very different on different machines.

That is one more reason I like local testing: it helps us see the full system behavior, not only the model response.

I also find that local environments are very useful for understanding these systems in depth. At home, I like to test locally first so I can see how the model, tools, and workflow behave in a controlled setup before thinking about anything more complex.

Thanks again for the discussion; I’m learning a lot from it :blush:.

1 Like