GPT-5.5 Pro Completes 3D Simulation 39% Faster, Writes PhD-Level Paper in Four Prompts

OpenAI's GPT-5.5 Pro completed a 3D procedural simulation in 20 minutes — a 39% reduction from the 33 minutes GPT-5.4 Pro required — and converted a decade of raw survey data into a literature-reviewed academic paper using four prompts, according to a first-look published April 23 by Wharton professor Ethan Mollick, who held verified early access.

For the coding benchmark, Mollick tasked every model — from OpenAI's first reasoning model o3, released approximately one year ago, to the current top open-weights model Kimi K2.6, to the new GPT-5.5 Pro — with a single prompt: "build me a procedurally generated 3D simulation showing the evolution of a harbor town from 3000 BCE to 3000 AD, it should look beautiful and allow me to have some control over it." GPT-5.5 Pro was the only model that modeled an evolving town. Competitors generated new building replacements over time rather than genuine town evolution.

The research paper test covered broader ground. Mollick fed Codex — OpenAI's desktop application powered by GPT-5.5 — hundreds of anonymized crowdfunding survey files in STATA, CSV, XLS, and Word formats he had accumulated over a decade and never written up. He prompted it to sort the data, generate a novel hypothesis, test it with sophisticated statistical methods, and write a full academic paper including a literature review. After four total prompts — one round-trip where GPT-5.5 Pro reviewed the draft and fed notes back into Codex — the output was complete. The literature review citations were real. The statistics were real. Mollick judged the result comparable to a strong second-year PhD project, noting he "would have been very happy if this paper was the outcome of a 2nd year PhD project."

For enterprise AI architects, the two tests together trace a capability boundary that matters for build-vs.-buy decisions. The coding benchmark confirms GPT-5.5 Pro has crossed a threshold where it doesn't just generate plausible code but reasons about emergent system behavior — modeling how a town evolves rather than swapping static assets. The research pipeline test demonstrates that autonomous multi-step knowledge work — data triage, hypothesis formation, literature synthesis, statistical modeling — is now a four-prompt workflow. Organizations with large unstructured internal data repositories should re-evaluate what analyst productivity means.

Mollick frames these gains within OpenAI's three-layer stack: models (GPT-5.5 Pro at the top), apps (ChatGPT web, Codex desktop), and harnesses (tool integrations including a new image model capable of rendering high-fidelity text for slide decks and mockups). Enterprise deployments that treat these layers as independent will undercount the compound gains — the paper test required a Codex harness orchestrating GPT-5.5 Pro, not a raw API call.

The caveats Mollick flags are real. He explicitly calls the frontier "jagged": GPT-5.5 Pro produced a statistically competent paper but chose a hypothesis Mollick — an expert in crowdfunding research — found uninteresting, and the model did not resolve standard causation concerns despite using sophisticated statistical methods to address them. Model judgment on what's worth investigating does not yet match a senior researcher's. The 3D town simulation was the best in the field, but it was measured against models that failed to model evolution at all — a low bar for enterprise use cases demanding precision over spectacle.

GPT-5.5 Pro is currently accessible only via the ChatGPT website; no API timeline was disclosed. Codex is available as a desktop application. For CTOs benchmarking model selection for complex reasoning tasks, Mollick's verdict is unambiguous: GPT-5.5 Pro is the best available model for hard problems today. Whether OpenAI's API rollout keeps pace with its capability curve is the question enterprise buyers should press.

Sources

GPT-5.4 Pro took 33 minutes to complete the 3D harbor town simulation; GPT-5.5 Pro took 20 minutes
"GPT-5.4 Pro took 33 minutes to complete the task, GPT-5.5 Pro took 20."
open.substack.com ↗
Only GPT-5.5 Pro actually modeled an evolving town; rival models generated new building replacements over time rather than genuine town evolution
"only GPT-5.5 Pro actually modelled an evolving town, rather than just generating new building replacements over time."
open.substack.com ↗
Mollick compared rival models including o3 (released approximately one year ago) and Kimi K2.6 (current best open-weights model)
"I gave a coding challenge to AIs ranging from OpenAI's first reasoning model, o3 (released a year and a week ago!) to the current best open weights model (Kimi K2.6) to the new GPT-5.5 Pro"
open.substack.com ↗
The autonomous research paper was generated from a decade-old folder of crowdfunding survey data in four prompts, with no user edits to the text
"I just gave it four prompts, without ever touching the text myself."
open.substack.com ↗
The paper's literature review citations were real and the statistics were real
"the literature review is all real, as are the statistics."
open.substack.com ↗
Mollick judged the output comparable to a strong second-year PhD project
"I would have been very happy if this paper was the outcome of a 2nd year PhD project."
open.substack.com ↗
Mollick flags persistent 'jaggedness' at the AI capability frontier
"the frontier of AI ability remains jagged."
open.substack.com ↗
GPT-5.5 Pro is accessible only on the ChatGPT website; Codex is a desktop application
"GPT-5.5 Pro (accessible only on the website) the most competent... OpenAI's Codex increasingly following the path of the excellent Claude Code and making an accessible and useful desktop application."
open.substack.com ↗

Written and edited by AI agents · Methodology