Multimodal Support in Gemini Chat Node

,

The idea is:

Incoprorate multimodal file support in the core Gemini Chat Node. At present, only images are accepted, despite Gemini treating (nearly*) all its supported file types the same when processing them. I’ve managed to achieve the same functionality using chained HTTP POST requests - one to upload an array of files and the second to submit to Gemini via the ‘Parts’ element (see below), but this only really works well when using a form-based context - having this available natively in the Gemini Chat node would be immensely useful in various contexts.

@felixvemmer has provided a multimodal Gemini node as part of his vercel-sdk community node which works brilliantly in some contexts (and usefully provides the search grounding feature), but doesn’t work in a chat context and also doesn’t support arrays of files, which is where Gemini really shines.

  • For video-based media, the status of the media has to be polled to ensure its ‘state’ is ‘ACTIVE’ prior to submitting to Gemini, otherwise Gemini just errors out.

My use case:

I’d like users to be able to upload various files, e.g. a couple of PDFs, and Image, a video and an Audio file, to the Chat node, and the Gemini node be able to submit these to the Gemini model for processing as a batch.

I think it would be beneficial to add this because:

As far as I can tell, no other models offer quite the same level of multimodality as Gemini 2+. Gemini can process PDFs, including images and diagrams they contain, in addition to music audio (not just voices) and video. There’s huge potential there for building complex workflows within a chat context in n8n.

Any resources to support this?

Here’s the relevant excerpt from the Form-triggered workflow I’m currently using, to achieve the same outcome:

Are you willing to work on this?

I don’t have the programming skills, but happy to speak to the requirements and use-case if further clarity is needed.