[Feature Request] Native Multi-modal (Vision-Language) Vector Navigation Tool for AI Agents

AI Agent Tool

The idea is:

Context & Motivation
With the rise of Vision-Language Models (VLMs) like Qwen-VL, Llama-3.2-Vision, and GPT-4o, AI Agents are no longer limited to text-based RAG. However, the current n8n implementation of Vector Store nodes is primarily optimized for text-to-text similarity.
Standard RAG (Question → Answer) is often too linear for complex brand identity or technical troubleshooting tasks. We need a way for Agents to “navigate” vector spaces using both visual and textual anchors simultaneously, leveraging advanced database features like Qdrant’s Recommendation API (Centroid Search).

My use case:

Use Case Example
“An agent receives a photo of a hardware error and a text prompt about branding. It uses the photo to find similar visual errors in the Knowledge Base while simultaneously applying the brand’s ‘friendly’ tone. It navigates to the intersection of these two semantic points to generate a perfectly aligned troubleshooting post.”

I think it would be beneficial to add this because:

Proposed Feature
An enhancement to the AI Agent node (or a dedicated Multi-modal Vector Tool) that allows for:
Multi-modal Input Support: Natively accepting image_url (or base64) alongside text in the agent’s tool-calling logic, following the OpenAI Message Schema.
Vector Navigation Logic: Instead of a simple “Search,” the tool should allow the Agent to pass a set of Positive and Negative IDs (points) to find the “semantic centroid” of a concept.
Direct Latent Interaction: The ability for the Agent to “explore” the vector space without requiring an explicit user query for every retrieval, allowing it to move between documentation and visual references as if they were coordinates.
Technical Requirements / Implementation Details
API Compatibility: Compatibility with the OpenAI-style multimodal payload ({“type”: “image_url”, …}).
Enhanced Vector Tool Node:
Operation Mode: Maps/Recommend (distinct from Retrieve).
Inputs: Positive Points (IDs), Negative Points (IDs), Search Context (Text/Image).
Strategy: Integration of the average_vector or best_score strategy (specifically for Qdrant users).
Workflow Integration: Ability to handle binary data (images) directly from previous nodes and pass them to the embedding model within the agentic loop.

Any resources to support this?

Are you willing to work on this?