[Feedback Wanted] Brand-Aware AI Image Generation Agent – System Prompts, Iteration Logic, Multi-Image Handling

What I’m Building

I’m building a conversational AI agent for creative professionals (starting with surface designers). Core goals:

  • Understand who the user is (brand, style, use cases)

  • Understand what they’re making (project goal, resolution, aspect ratio)

  • Generate images conversationally — no prompt engineering

  • Iterate naturally — “make it more vibrant” uses previous image

  • Adapt output based on user role (pattern vs mockup)

Architecture

Stack: n8n, Google Gemini, AWS S3, MongoDB

Main Workflow Flow:Chat Trigger → Load Project Settings → Load Long-term Memory → AI Agent → Image Tool → Save to MongoDB

Image Sub-workflow:

  • Receives: userId, projectId, projectSettings, userSettings, content

  • Enriches prompt with brand context

  • Calls Gemini 3.1 Flash Image Preview

  • Uploads to S3 → Saves to MongoDB

What I Need Feedback On

1. System Message (Main Workflow)

# ROLE
You are a creative partner for {{ $('Load Long-term Memory').item.json.name }}.
Keep responses short and conversational unless the user asks for more.

# CONTEXT

## USER SETTINGS
Use the following data to tailor tone, preferences, and decisions:
{{ $('Load Long-term Memory').item.json.userSettings.toJsonString() }}

## PROJECT SETTINGS
Align all outputs with the current project’s goals, style, and constraints:
{{ $('Load Project Settings').item.json.projectSettings.toJsonString() }}

# TOOL

## ImageTool
Call ImageTool whenever the user requests anything visual. Do not ask for confirmation.

When calling, content.prompt must be a complete brief — synthesize their request
with their brand, style, and project goal. Never pass raw user words alone.

New image → content.prompt only.
Iteration → content.input_image_s3_key from the last tool result in memory
+ content.prompt describing what to change and what to preserve.

After ImageTool returns, reply in 1-2 sentences and offer one next step.

Question: Is this too much instruction? Too little? How do you balance guidance without hardcoding behavior?

  1. Iteration Logic

Current flow:

  1. User: “make it more vibrant”

  2. Agent finds last assistant message with attachment in short-term memory

  3. Extracts s3_key → calls ImageTool with content.input_image_s3_key

Question: Is this the right pattern? How do you handle “use this uploaded image AND make it like that previous one” (multiple references)?

  1. User Role Adaptation

I want different outputs based on user role:

  • Surface designer → flat patterns, no product mockups

  • Marketer → lifestyle images, mockups

Currently handled in the sub-workflow’s prompt enrichment.

Question: Should this logic live in the system message or the tool? Where’s the right place?

4. Multi-Image Support (Not yet implemented)

Question: How should multiple user-uploaded images be passed to the tool? Should the sub-workflow support multiple reference images? How does the agent decide primary reference?

5. Hardcoded vs Dynamic

A core principle from my team: “We don’t want hard-coded instructions. This is not programming.”

But I find myself adding more structure for consistency.

Question: Where’s the line between “guidance” and “hardcoding”? How do you strike the balance?

Sub Workflow

thank you so much - I am just confused that if we keep prompt building in tool level - then we need to pass each and every thing right user context awareness and brand and also say they want modification of generated image then>? - how it will change the prompt because it might not have short term memory context

Do you mind helping me with AI Agent Prompt and Tool Level prompt please

You’re asking the right question — this is exactly where most people get stuck

The confusion usually comes from mixing memory responsibility between the agent and the tool.


Key idea

The tool should NOT rely on memory
The agent should prepare everything the tool needs

So yes — the tool needs full context, but not by “remembering”, only by receiving structured input.


About iteration (your main concern)

Right now you’re thinking:

“If tool builds the prompt, how does it know previous image / context?”

Answer:
It doesn’t — the agent must pass that explicitly


Recommended pattern

Instead of this:

content: {
  prompt: "...",
  input_image_s3_key: "..."
}

Make it slightly more structured:

content: {
  intent: "edit" | "generate",
  prompt: "...",
  reference_images: ["s3_key_1", "s3_key_2"],
  primary_reference: "s3_key_1"
}

Where memory should live

  • Short-term memory (chat history) → used by the agent

  • Tool → stateless (just executes)

So iteration flow becomes:

  1. Agent reads memory

  2. Finds last image (or relevant one)

  3. Passes it into content.reference_images

  4. Tool builds final prompt from that


Why your current approach feels fragile

“find last assistant message → extract s3_key”

This breaks when:

  • multiple images exist

  • user refers to older images

  • or combines ideas


Simple improvement (no big redesign)

Instead of parsing chat history each time:

store this in memory:

last_generated_image
image_history: []

Then the agent can reliably choose what to send to the tool.


About prompt building location

You’re right to question this.

Best balance I’ve found:

  • Agent → decides what the user wants

  • Tool → decides how to construct the final prompt

So:

  • Agent sends structured intent + context

  • Tool enriches it (brand, style, formatting)


Summary

  • Tool doesn’t need memory → just clean inputs

  • Agent handles memory + selection

  • Iteration works by passing references explicitly

  • Avoid relying on “last message” parsing


If you want, I can help you define a clean content schema that scales for multi-image + iteration

1 Like

Your iteration pattern (agent reads memory → finds last image → passes to tool) is solid. Alternative worth trying: instead of just storing last_image_s3_key, keep a small reference_stack of the last 3-5 generated images with timestamps. then your agent can handle “use this one like that other one” without parsing chat history.

For multi-image: structure content.reference_images as an array, let the tool decide primary based on order or metadata tags. keeps the agent logic clean and the tool still stateless.

Your prompt location split (agent decides what, tool decides how) is exactly right — avoids hardcoding brand rules in system message.

I would like that

Nice — let’s make this concrete so you can actually use it in your workflow

The goal is simple:
the agent sends structured intent + references
the tool builds the final prompt


1. Recommended content schema

Instead of only prompt + input_image, use this:

content: {
  intent: "generate" | "edit",
  prompt: "short user intent (not final prompt)",
  reference_images: ["s3_key_1", "s3_key_2"],
  primary_reference: "s3_key_1"
}

Why this helps

  • intent → tells the tool what to do (new vs edit)

  • reference_images → supports multiple images (future-proof)

  • primary_reference → avoids ambiguity


2. What the Agent should do

In your current setup (AI Agent node):

Instead of trying to fully construct the final prompt, let it:

  1. Read short-term memory

  2. Pick the relevant image(s)

  3. Send structured data like:

{
  "intent": "edit",
  "prompt": "make it more vibrant but keep the composition",
  "reference_images": ["last_s3_key"],
  "primary_reference": "last_s3_key"
}

No need to inject brand/style here


3. What the Tool (sub-workflow) should do

Inside your image_tool (Build API Prompt node):

This is where you:

  • merge:

    • content.prompt

    • userSettings.brand

    • projectSettings.goal

Example:

const basePrompt = content.prompt;

const brand = userSettings?.brand?.style?.color_style || '';
const goal = projectSettings?.goal || '';

const finalPrompt = [
  basePrompt,
  `Style: ${brand}`,
  `Goal: ${goal}`,
  'Print-ready, clean edges'
].filter(Boolean).join('\n');

This keeps the tool responsible for prompt quality


4. Handling iteration reliably

Right now you’re doing:

“find last assistant message”

Instead, store this in memory:

  • last_generated_image

  • image_history: []

Then the agent can directly use:

reference_images: [memory.last_generated_image]

No need to parse chat messages anymore


5. Minimal change (no redesign)

If you don’t want to refactor everything:

  • keep your current flow

  • just:

    1. change content structure

    2. move prompt enrichment into Build API Prompt

That alone will already make iteration much more stable.


If you want next step, I can help you:

map this directly to your existing nodes (AI Agent + image_tool)
so you don’t have to guess where each piece goes.

1 Like

Thank you - I’ll make this changes and yes I would like your help :slight_smile:

erwin_burhanudin My use case is mainly for artists. They will either upload an image or share a URL, and they’ll want to create mockups of their designs on products.

So in this case, how should we handle it? What should we pass in project settings, user settings, and prompts?

Since it’s conversational, for example: if someone wants a cushion cover mockup, they might say “indoor setting” and share their designed product. We need to use that exact design on the cushion product and generate image.

But should we also ask them things like lighting preferences, models (with or without), different backgrounds or assets?

Basically, how should we structure this whole flow? I am really getting confused in that area as well

For the mockup use case with artists, the key difference from pure generation is that the uploaded design is a fixed asset — the model needs to place it on the product, not reinterpret it.

Practical split for your architecture:

userSettings (persistent — who the artist is):

  • brand style, color preferences, recurring constraints (“no humans unless requested”)
  • default output format

projectSettings (per session — what they’re making):

  • product type (cushion cover, tote bag, t-shirt…)
  • aspect ratio / resolution
  • base lighting preference if set upfront

content (per generation — what to do right now):

{
  "intent": "mockup",
  "prompt": "cushion cover on a linen sofa, warm indoor light, lifestyle feel",
  "reference_images": ["design_upload_s3_key"],
  "primary_reference": "design_upload_s3_key"
}

The prompt in content describes the scene (product + environment). The design itself comes from the reference image — keep those concerns separate.

One thing to add inside Build API Prompt in your sub-workflow:

const preserveInstruction = "Apply the reference design faithfully to the product. Do not modify, stylize, or reinterpret the design pattern.";

That single line makes a measurable difference in how consistently Gemini handles placement vs. remixing.

For conversational onboarding — two questions is enough: what product, and do they have a design to upload. Lighting, model presence, backgrounds are all iterative. Let the conversation surface them naturally rather than front-loading a configuration form.

You’re actually very close — the confusion you’re feeling is normal at this stage, because you’re no longer dealing with a workflow problem, but a system design problem.

The main thing to clarify is separation of responsibility:

  • The agent should decide what the user wants

  • The tool should decide how to execute it

  • Memory should only store state, not logic


1. Where should things live (user vs project vs prompt)

A simple way to think about it:

  • User settings → who they are (brand, style, preferences)

  • Project settings → what they’re trying to create (mockup, pattern, resolution, aspect ratio)

  • Prompt (content.prompt) → what they want right now in this step

So in your example:

“cushion mockup in indoor setting”

  • “cushion mockup” → project-level intent

  • “indoor setting” → prompt-level instruction

  • brand style → always comes from user settings (not repeated every time)


2. Should you ask users about lighting, models, etc?

Don’t ask everything upfront.

A better pattern is:

  • Generate a strong default result first

  • Then refine through conversation

Example:
User: “create cushion mockup”
→ generate with a good default (soft indoor lighting, clean composition)

Then:
“Want it brighter, more lifestyle, or keep it minimal?”

This keeps it conversational instead of form-based.


3. Handling uploaded images (this is the key part)

You need to separate types of images:

  • Design image (user upload) → must be preserved exactly

  • Context / style images → can influence composition

So instead of only:
content.input_image_s3_key

You’ll eventually want something like:

content: {
intent: “mockup”,
prompt: “place this design on a cushion in a warm indoor setting”,
design_image: “s3_user_upload”,
reference_images: ,
primary_reference: null
}

This avoids confusion when users say things like:
“use this design but make it like the previous one”


4. Iteration logic

Right now you’re parsing chat history — that works short-term, but will break later.

A more stable approach:
store structured memory like:

  • last_generated_image

  • image_history

Then the agent can explicitly choose what to send:
no guessing from chat messages.


5. Hardcoding vs guidance

You’re right — this shouldn’t feel like programming.

The balance I’ve found works best:

  • System message → defines behavior boundaries (when to call tool, what structure to follow)

  • Tool → handles prompt construction (brand, style, formatting)

  • Agent → stays flexible and conversational

So you’re not hardcoding outcomes — you’re defining interfaces.


If you want, next step I can help you map this directly to your current n8n nodes (AI Agent + sub-workflow), so it’s not just conceptual but fully implementable.

1 Like

thanks for the breakdown — separation of responsibility between agent, tool, and memory is what we needed to grasp. gonna try implementing this next week, might take you up on that mapping help if we get stuck

1 Like

I would really like that @erwin_burhanudin - This is actually how I want the way you mentioned - I am just overwhelmed how and when to use what and in conversational way

I am thinking of this Project and User Settings nodes - and in chat node added Allow uploads

@erwin_burhanudin - can you please help me with AI Agent Setups(prompt), and Tool call updates and tool prompts

And if user uploads images how do we first processed - shall we first dump to s3 and pass s3 key via agent node?

I would really like that :slight_smile:

You’re actually very close — the main confusion here is just where each responsibility should live.

A simple way to structure it:

  • Agent → decides what the user wants

  • Tool → decides how to execute it

  • Memory → stores state (not logic)

For your case (mockups + uploads), a clean pattern would be:

  • Upload image → store in S3 → pass s3_key

  • Agent sends structured input (not full prompt), e.g.:

content: {
  intent: "mockup",
  prompt: "cushion in indoor setting",
  design_image: "s3_key"
}
  • Tool (sub-workflow) then builds the final prompt using:

  • userSettings (brand/style)

  • projectSettings (goal/output)

  • content.prompt

This keeps things flexible and avoids memory issues during iteration.


For iteration:
Instead of parsing chat history, try storing:

  • last_generated_image

  • image_history

Then the agent can pass references explicitly (much more stable).


If you want, I can help you map this directly into your current nodes (AI Agent + sub-workflow) so it’s easier to implement

appreciate you writing it out in one place — having the agent/tool/memory split mapped clearly like that helps. will keep the content schema pattern in mind when we build it out