This article builds on What Is an MCP Server? – where I explain the basics.
AI image generators are great for abstract heroes – but the moment a concrete logo has to appear, it gets messy: the model invents a logo that only resembles the real one. For a product hero meant to show the actual Home Assistant logo, that's useless. The fix: send the logo along as a reference image instead of merely describing it.
That's exactly what I retrofitted into my existing generate_ai_image MCP tool. Here I show how image input works with both providers, how the tool automatically switches to the right model – and which pitfall cost me a few wasted generations.
Why text-to-image isn't enough
The classic path is pure text-to-image: prompt in, image out (DALL·E 3, Imagen). But the model has no template – it can only approximate a brand logo from memory. The result: warped letters, wrong proportions, a “logo” that isn't one. For brand assets that's a no-go.
What we need is an image input: we hand the model the real logo and say “build this cleanly into the scene.” Both major providers can do it – just via different endpoints.
Two paths: images.edit vs. generate_content
With OpenAI, image input doesn't go through images.generate but through images.edit with the gpt-image-1 model. You pass one or more input images plus a prompt. Important: the bytes must arrive as file-like objects with a set .name, otherwise the SDK complains.
# Referenzbild-Modus: gpt-image-1 image-edit mit (mehreren) Eingabebildern
if reference_images:
ref_model = model if str(model).startswith("gpt-image") else GPT_IMAGE_REF_MODEL
size = GPT_IMAGE_SIZE.get(aspect_ratio, "1024x1024")
inputs = []
for i, data in enumerate(reference_images):
bio = BytesIO(data)
bio.name = f"reference_{i}.png" # SDK braucht einen Dateinamen
inputs.append(bio)
response = client.images.edit(
model=ref_model,
image=inputs if len(inputs) > 1 else inputs[0],
prompt=prompt,
size=size,
)
return _openai_response_to_contentfile(response)With Google, Imagen cannot take image inputs. For that there's gemini-2.5-flash-image, which runs via generate_content – prompt and image come in as a Part list, and the finished image sits in the response's inline_data.
# Imagen kann keine Bild-Inputs -> Flash-Image (generate_content)
if reference_images:
ref_model = model if "flash-image" in str(model) else GEMINI_REF_MODEL
client = genai.Client(api_key=GOOGLE_API_KEY)
contents = [prompt]
for data in reference_images:
contents.append(types.Part.from_bytes(data=data, mime_type="image/png"))
response = client.models.generate_content(model=ref_model, contents=contents)
for cand in response.candidates or []:
for part in cand.content.parts or []:
inline = getattr(part, "inline_data", None)
if inline and inline.data:
return ContentFile(inline.data)
raise ValueError("Gemini (Flash-Image) hat kein Bild zurückgegeben.")From the MCP tool to the bytes
The MCP tool itself should stay convenient: it takes either Wagtail image IDs (reference_image_ids) or public URLs (reference_image_urls) and resolves both to raw bytes before passing them to the service. If references are present, the pipeline automatically switches to the image-capable model – the pure text-to-image path stays untouched.
reference_images: list[bytes] = []
# 1) Wagtail-Bilder per ID
for iid in reference_image_ids or []:
img = Image.objects.get(id=iid)
with img.file.open("rb") as f:
reference_images.append(f.read())
# 2) Beliebige oeffentliche URLs
for url in reference_image_urls or []:
reference_images.append(requests.get(url, timeout=30).content)
image = generate_and_save_image(
prompt=prompt,
title=title,
use_case=use_case,
reference_images=reference_images or None, # None -> reiner Text-Pfad
)By the way: this MCP server – image generation included – doesn't run in the cloud for me but on a frugal mini-PC in the homelab that also carries Home Assistant and various Docker stacks:
Ad · Affiliate link – if you buy through it, I may earn a commission. It doesn’t change the price for you.
Pitfall: provider ≠ model
One thing cost me real generations: I wanted to force the provider to openai for a logo image but didn't pass the model. My config resolver then pulls the model from the settings – and there sat the Imagen model. The result: an Imagen model name ends up in the OpenAI call, which acknowledges it like this:
Error code: 400 - The model 'imagen-4.0-generate-001' does not exist.Lesson: provider and model belong together. Either set both explicitly (provider="openai", model="gpt-image-1") or use no override at all and rely on the configured default. The reference-image branch does catch a mismatched model and picks the respective image model – but only if the provider matches the model in the first place.
What I left out
- A real live test in CI: the image APIs cost tokens and need keys – the logic is built to the SDK docs and statically checked; the real test ran first against an actual logo.
- Auth & error handling: retries, rate limits, invalid URLs – needed for production, deliberately kept brief here.
- Text in the image: both models love to drop unwanted letters into the image; a clear “no text” in the prompt helps but is no guarantee.
Conclusion & outlook
With a few lines, a text-to-image tool becomes one that cleanly adopts real logos into the scene – the key is the right endpoint per provider (images.edit or generate_content) and a tool that resolves IDs/URLs to bytes. The hero of this article, by the way, was made exactly this way. The next logical step: the same reference mechanism for consistent characters across multiple heroes.
Ad · Affiliate link – if you buy through it, I may earn a commission. It doesn’t change the price for you.