How to use multimodal models from Ollama (e.g. LLaVA) in n8n?

Hi team,

I’d like to ask whether n8n currently supports integration with multimodal models released via Ollama — particularly models like LLaVA, which can process both image and text inputs.

Use case:

I’m building a workflow in which I want to:

  1. Upload or receive an image (e.g., via HTTP Request or Telegram Trigger).

  2. Send that image + a prompt to a multimodal model (e.g., LLaVA running via Ollama on my local server).

  3. Receive a response from the model (e.g., image caption, object detection, etc.) and use it in the flow.

My questions:

  1. Is there currently a way to call Ollama’s multimodal models directly from n8n?

  2. If not officially supported, is there a suggested workaround using HTTP Request node to send a request to a local ollama server (e.g., http://localhost:11434/api/generate) with an image file?

  3. Will there be official support for multimodal AI workflows (image+text input) in the future?

Any guidance or examples would be greatly appreciated!

Thanks for building such a powerful platform :folded_hands:

Best regards,

[Your Name]

Hi there,

While I haven’t built a full example yet, I believe it’s possible to call Ollama’s multimodal models like LLaVA in n8n using the HTTP Request node. You’d need to send a POST request to http://localhost:11434/api/generate, including your prompt and the image encoded in base64.

You can use a Function node in n8n to convert the image to base64 before sending the request.

I’m also interested in this use case and happy to collaborate or test ideas if helpful!

Best, regards

n8n enthusiast - kz

Yes, it works with an HTTP Request Node. I’ve built a simple workflow for checking if there are animals present on my wildlife camera images (as the camera sometimes triggers even if no animal is in the focus).

This is my workflow (with german notes):