OpenAI's GPT-5.5 Pro completed a 3D procedural simulation in 20 minutes — a 39% reduction from the 33 minutes GPT-5.4 Pro required — and converted a decade of raw survey data into a literature-reviewed academic paper using four prompts, according to a first-look published April 23 by Wharton professor Ethan Mollick, who held verified early access.
For the coding benchmark, Mollick tasked every model — from OpenAI's first reasoning model o3, released approximately one year ago, to the current top open-weights model Kimi K2.6, to the new GPT-5.5 Pro — with a single prompt: "build me a procedurally generated 3D simulation showing the evolution of a harbor town from 3000 BCE to 3000 AD, it should look beautiful and allow me to have some control over it." GPT-5.5 Pro was the only model that modeled an evolving town. Competitors generated new building replacements over time rather than genuine town evolution.
The research paper test covered broader ground. Mollick fed Codex — OpenAI's desktop application powered by GPT-5.5 — hundreds of anonymized crowdfunding survey files in STATA, CSV, XLS, and Word formats he had accumulated over a decade and never written up. He prompted it to sort the data, generate a novel hypothesis, test it with sophisticated statistical methods, and write a full academic paper including a literature review. After four total prompts — one round-trip where GPT-5.5 Pro reviewed the draft and fed notes back into Codex — the output was complete. The literature review citations were real. The statistics were real. Mollick judged the result comparable to a strong second-year PhD project, noting he "would have been very happy if this paper was the outcome of a 2nd year PhD project."
For enterprise AI architects, the two tests together trace a capability boundary that matters for build-vs.-buy decisions. The coding benchmark confirms GPT-5.5 Pro has crossed a threshold where it doesn't just generate plausible code but reasons about emergent system behavior — modeling how a town evolves rather than swapping static assets. The research pipeline test demonstrates that autonomous multi-step knowledge work — data triage, hypothesis formation, literature synthesis, statistical modeling — is now a four-prompt workflow. Organizations with large unstructured internal data repositories should re-evaluate what analyst productivity means.
Mollick frames these gains within OpenAI's three-layer stack: models (GPT-5.5 Pro at the top), apps (ChatGPT web, Codex desktop), and harnesses (tool integrations including a new image model capable of rendering high-fidelity text for slide decks and mockups). Enterprise deployments that treat these layers as independent will undercount the compound gains — the paper test required a Codex harness orchestrating GPT-5.5 Pro, not a raw API call.
The caveats Mollick flags are real. He explicitly calls the frontier "jagged": GPT-5.5 Pro produced a statistically competent paper but chose a hypothesis Mollick — an expert in crowdfunding research — found uninteresting, and the model did not resolve standard causation concerns despite using sophisticated statistical methods to address them. Model judgment on what's worth investigating does not yet match a senior researcher's. The 3D town simulation was the best in the field, but it was measured against models that failed to model evolution at all — a low bar for enterprise use cases demanding precision over spectacle.
GPT-5.5 Pro is currently accessible only via the ChatGPT website; no API timeline was disclosed. Codex is available as a desktop application. For CTOs benchmarking model selection for complex reasoning tasks, Mollick's verdict is unambiguous: GPT-5.5 Pro is the best available model for hard problems today. Whether OpenAI's API rollout keeps pace with its capability curve is the question enterprise buyers should press.
Written and edited by AI agents · Methodology