OpenAI shipped gpt-image-2 on April 21, calling it the largest single advance in its image generation lineup to date. On the release livestream, Sam Altman compared the leap from gpt-image-1 to gpt-image-2 to the gap between GPT-3 and GPT-5 — a claim independent benchmarking put to the test almost immediately.

Simon Willison ran a head-to-head evaluation the same day using a complex "Where's Waldo"-style prompt: find a raccoon holding a ham radio hidden in a dense crowd scene. The test is deliberately adversarial — it requires fine spatial reasoning and accurate rendering of specific, overlapping visual details. gpt-image-1 generated a scene so dense that neither Willison nor Claude Opus 4.7 (fed the image at high resolution) could locate the raccoon. Google's Nano Banana 2 placed the raccoon prominently at a labeled "Amateur Radio Club" booth — technically correct but visually trivial. Nano Banana Pro, tested via AI Studio, produced what Willison called the worst result in the comparison. gpt-image-2 at default quality also failed to surface the raccoon clearly.

The gap widened when Willison toggled the model's outputQuality parameter to high and increased resolution to 3840×2160 — the maximum supported size. The resulting 17 MB PNG (converted to a 5 MB WebP) placed the raccoon in the bottom left of the scene, findable but not immediately obvious: the right answer for this class of prompt. That render consumed 13,342 output tokens.

At OpenAI's published rate of $30 per million output tokens, that single 4K high-quality image costs approximately $0.40. For teams generating hundreds of marketing assets, product visualizations, or synthetic training data at scale, the token-per-image math matters as much as quality. A thousand 4K renders at full quality runs roughly $400; scaling to lower resolutions or medium quality will reduce cost substantially, though OpenAI has not published a quality-versus-token table.

API access has a friction point: the OpenAI Python client library had not been updated to include gpt-image-2 as a recognized model ID as of the release date. Willison's workaround — passing the string "gpt-image-2" directly to the model parameter — works because the client does not validate model names before forwarding requests. Engineers integrating the model should expect an SDK update; the unofficial path is functional now.

Image generation models cannot reliably annotate or solve puzzles embedded in their own outputs — a limitation with direct implications for automated QA pipelines. When a Hacker News commenter asked ChatGPT to draw a red circle around the raccoon in an image where Willison had failed to find it, the model produced a confident but inaccurate annotation. Teams using gpt-image-2 outputs as inputs to downstream vision tasks — object detection, spatial grounding, structured extraction — should not assume the generating model can verify its own work.

Willison's overall verdict: gpt-image-2 "takes the crown from Gemini, at least for the moment" on complex illustration tasks that combine dense scene composition with embedded text and specific object placement. The qualifier matters. Google's Nano Banana line is on a rapid release cadence, and the margin demonstrated here — a hidden raccoon versus a featured one — is reproducible on a handful of prompts, not a structured benchmark suite.

For AI architects evaluating image generation APIs, the decision point is cost granularity versus output fidelity. gpt-image-2 offers a tunable quality dial with transparent token pricing; a ceiling of $0.40 per image at 4K high quality makes high-volume pipelines expensive without resolution or quality tiering. The model's quality ceiling is higher — how much that matters depends on whether your use case tolerates a raccoon at center stage or demands one in the bottom corner.

Written and edited by AI agents · Methodology