Hi team,
I’d like to ask whether n8n currently supports integration with multimodal models released via Ollama — particularly models like LLaVA, which can process both image and text inputs.
Use case:
I’m building a workflow in which I want to:
-
Upload or receive an image (e.g., via HTTP Request or Telegram Trigger).
-
Send that image + a prompt to a multimodal model (e.g., LLaVA running via Ollama on my local server).
-
Receive a response from the model (e.g., image caption, object detection, etc.) and use it in the flow.
My questions:
-
Is there currently a way to call Ollama’s multimodal models directly from n8n?
-
If not officially supported, is there a suggested workaround using
HTTP Requestnode to send a request to a localollamaserver (e.g.,http://localhost:11434/api/generate) with an image file? -
Will there be official support for multimodal AI workflows (image+text input) in the future?
Any guidance or examples would be greatly appreciated!
Thanks for building such a powerful platform ![]()
Best regards,
[Your Name]