I’m building an ad banner generator workflow. The flow:
User sends a prompt → gpt-image-2 generates an original ad image (1024×1024)
User selects banner sizes (300×250, 728×90, 160×600, 320×100, 300×50, 970×250) etc 50+sizes
The model should adapt the original ad into all selected formats and place them on a single black canvas
The problem: gpt-image-2 ignores aspect ratio instructions in the prompt. I’ve tested:
Exact pixel coordinates and dimensions → ignored (50%+ deviation)
Numeric ratios like “3.88:1” → ignored
“CONTRACT”, “NON-NEGOTIABLE” language → ignored
IAB standard names (LEADERBOARD, SKYSCRAPER) → partial improvement
Wireframe reference image (colored rectangles as layout template) → best result, but still 15-70% deviation for extreme ratios like 728×90 (8:1) and 160×600 (0.27:1)
Using OpenRouter /chat/completions with openai/gpt-5.4-image-2, image_config: { aspect_ratio: “16:9”, image_size: “2K” }.
What works: moderate ratios (1:1 to 4:1) — 0.6-5% deviation.
What doesn’t: extreme ratios (6:1+, 1:4+) — model “normalizes” them closer to square.
Constraints: Can’t use OpenAI API directly (no key yet). Can’t do separate API call per banner (too expensive with 50 sizes).
Has anyone solved this? SD or sharp+canvas not good idea
@Viktor_Skobeliev gpt-image-2 will never nail exact pixel ratios, you’re fighting the model’s architecture — it literally can’t output arbitrary aspect ratios, it normalizes everything toward square. generate the 1024×1024 once then crop+resize programmatically in a Code node, that’s the only way to get 0% deviation on 50+ sizes.
set NODE_FUNCTION_ALLOW_EXTERNAL=sharp in your env and npm install sharp in your n8n directory, swap in your OpenRouter key and your actual sizes array, one API call total.
@Viktor_Skobeliev — yes you are hitting a known limitation, gpt-image-2
(and honestly every diffusion model) collapses extreme aspect ratios
toward the trained distribution which is heavily square-biased. no
amount of prompt engineering fixes that, you’re fighting the architecture.
i saw you ruled out sharp+canvas already, but i’d actually push back on
that one — the cost objection was about “separate API call per banner”
which is a different problem. with sharp you’re making one generation
call, then doing local image manipulation on the 1024×1024 output.
zero extra API cost, deterministic pixel-perfect output, runs in like
200ms for 50 sizes.
the trick isn’t a naive resize though, that distorts faces and text.
what actually works for ad banners is a smart-crop + composite pipeline:
- generate the 1024×1024 hero image once via gpt-image-2
- use a separate gpt call (cheap, no image model) to identify the
“focal area” — main subject, logo position, headline area — return as
bounding box coordinates - for each target size, crop around the focal area with proper aspect
ratio preservation, then resize. for extreme ratios like 728×90, you
extract a horizontal slice through the focal point, for 160×600 a
vertical slice - for sizes where the focal area can’t fit (e.g. 320×50), composite
the cropped subject onto a brand-color background with auto-extended
edges (sharp’sextendwith edge sampling does this well)
this is essentially how Bannerflow, Smartly, and Adobe Express handle
multi-size adaptation under the hood — they all generate once, adapt
locally.
if you absolutely need a single canvas with all banners laid out,
generate each one separately via sharp, then composite onto a black
canvas using sharp().composite([{input, top, left}, ...]).
what’s the actual delivery format you need at the end — individual files
per size, or one mosaic image showing all banners together? that
changes the last step quite a bit.
Thanks for the detailed breakdown — this confirms exactly what we found after a day of testing. The wireframe reference image approach got us closest (0.6% deviation on 300×250) but extreme ratios like 160×600 and 728×90 are fundamentally broken.
Your focal-area pipeline makes sense and that’s the direction we’re leaning. Quick question though — for ad banners specifically, a simple crop around the focal point loses the “recomposition” that makes each format feel intentional (text repositioned, elements rearranged).
Have you compared these two approaches in practice:
Smart crop (GPT-4o finds focal area → Sharp crops per format) — cheap, deterministic, but it’s still a crop, not a redesign
SD img2img (feed the 1024×1024 into Stable Diffusion with ControlNet at each target size, low denoising ~0.3) — more expensive, but the model actually recomposes elements for each aspect ratio
For something like 970×250 from a square original, crop gives you a horizontal slice. But SD img2img could actually spread the elements horizontally like a designer would. Is the quality difference worth the extra complexity in your experience?
The final compositing onto a black canvas is trivial either way — just Sharp composite.
yes, but only on the extreme ratios — running SD on everything is where the complexity stops paying off.
what actually works in practice: - smart crop for 1:2 to 2:1 ratios (300×250, 336×280, 320×100). focal-aware crop with `extend` for breathing room — looks indistinguishable from a redesign here. - SD img2img + ControlNet for the extremes (728×90, 160×600, 970×90). this is exactly where crop visibly fails. one thing though — 0.3 denoising is too low, the model barely moves elements. try 0.35-0.45, that’s the sweet spot for actual horizontal/vertical recomposition. quick tip: pass your wireframe layout as the ControlNet input alongside the 1024×1024. without it SD scatters elements awkwardly at 8:1. cost-wise if ~12 of your 50 sizes are extremes, you’re running SD on 24% of outputs — roughly 3x cheaper than img2img on everything, with most of the quality gain. want me to share the n8n routing for this? Switch node by aspect ratio, two parallel branches merging at the composite step — pretty clean once you see it.
That hybrid split makes total sense — we were overcomplicating it by trying one approach for everything.
Yes please, would love to see that n8n routing! The Switch node by aspect ratio into two branches is exactly what we need.
Also great tip on 0.35-0.45 denoising and wireframe as ControlNet input — we actually already generate a wireframe in Sharp (colored rectangles on black canvas) so that’s ready to go.
Which SD model are you using for the extreme ratios? And are you running it via API (Replicate/RunPod) or self-hosted?