Node: Gemini multimodal GenAI (Vertex AI)

It would help if there was a node for:

Google Gemini multimodal (Vertex AI)

My use case:

TL;DR: Cheaper, seemingly faster, and possibly better than OpenAI’s AnalyzeImage node.

N8N already has a node for Open AI GPT4 Vision API, called “OpenAI - Analyze Image”. It was released recently, possibly following a request in Please add support of the new OpenAI features [done] - #26 by tomtom

I did a few comparison between OpenAI and Google for the same multimodal use case: image + prompt. Gemini didn’t do bad at all. It seems faster (comparing the Google console with the N8N node, not really a fair comparison) and better creative results (i.e. my impressions, not a fair comparison either)
The biggest difference is the pricing: for an image+prompt, Gemini is 4X cheaper (based on an image of roughly 600x600). Google’s pricing is flat per image, while OpenAI’s pricing is proportional to the image size.

I therefore think that the Google-based node could be more popular than the OpenAI-based node. The UI and parameters for the Google node (prompt + image URL) could be the same as for the OpenAI node.

Any resources to support this?

Vertex AI has a sandbox in the Google cloud console
API docs are at Google Cloud console
Pricing at Pricing  |  Generative AI on Vertex AI  |  Google Cloud

I understood that Vertex AI is the name for the GenAI multimodal API. PaLM only does text in inputs and outputs. The model running inside Vertex AI I could test is called “gemini-1.0-pro-vision-001”.

I decline any responsibility for the fact that Google could, at any time, rename their models and product in the most confusing way possible :slight_smile:

Are you willing to work on this?

I can make a fork in my workflow and help testing that requested node against the currently available OpenAIanalyzeImage node.

Hello, Did anyone looked at this yet ?

I didn’t hear back after that request. I guess it needs enough upvotes to be considered?

I found out that n8n must be aware of it, since there’s a landing page optimized for gemini and Vertex AI keywords, but seemingly nothing concrete behind: Google Vertex AI integrations | Workflow automation with n8n

This would be an amazing feature, we can’t use any multi-modal functionalities within an AI Agent, it would be great to be able to pass through an audio file directly to Gemini without having to use Whisper first for example

I’d second this; and would have thought the existing Gemini node could be tweaked to allow non-image binaries to be passed through, as the same functionality (submitting audio/video) can be achieved with the http node, albeit less elegantly!

There’s a wealth of potential in the Gemini multimodal abilities; it can analyse music for example, providing info on structure and influences etc., as well as just transcribe words, but I currently have to string code and http nodes together to achieve this.

Just to add, if it helps with anyone willing and able to develop this, that the Multimodal capabilities can be achieved through the HTTP node, as below (using a form submission prompt and the Gemini API rather than Vertex, but the same principles apply). Having these capabilities somehow integrated into the Gemini / Vertex nodes would be a game changer though - I’m not aware of any other models that can work across file types the way Gemini 2+ can.

1 Like