PhyCo Adds Physics to Video Diffusion Without Simulators

Researchers at Carnegie Mellon University and UC San Diego have released PhyCo, a framework that embeds continuous, interpretable physical constraints — friction, restitution, deformation, and applied force — directly into video diffusion models. The result: physically consistent video synthesis without a physics simulator at inference time.

Current video generation models produce high visual fidelity but fail basic physics. Objects drift through surfaces. Collisions produce no rebound. Material deformation bears no relationship to underlying properties. PhyCo addresses this gap using three components. The team constructed a dataset of over 100,000 photorealistic simulation videos in which friction, restitution, deformation, and force are systematically varied. They fine-tuned a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps, allowing the model to accept material parameters as direct control signals. They layered in VLM-guided reward optimization: a vision-language model evaluates generated clips against targeted physics queries and feeds differentiable feedback into the training loop.

On the Physics-IQ benchmark, PhyCo improves physical realism over baselines. Human studies confirm that generated outputs exhibit clearer control over physical attributes without simulator or geometry reconstruction at inference.

For enterprise architects, the key advantage is inference-time autonomy. Existing physically grounded generation approaches require a live physics engine or explicit 3D mesh to constrain outputs — costs that escalate at production scale. PhyCo encodes physical priors into model weights via ControlNet conditioning. Inference is a standard diffusion pass. This makes PhyCo a candidate for industrial design, product visualization, and synthetic data generation workflows that otherwise require simulator infrastructure.

FIG. 02 PhyCo combines a pretrained diffusion backbone with ControlNet conditioning and VLM-guided reward optimization to enforce physics at inference time without simulators. — ai|expert diagram

The robotics application is acute. Training manipulation policies on video breaks down when contact dynamics are unrealistic. A video model that correctly renders the difference between a rubber gripper contacting rigid metal versus foam could produce higher-fidelity training rollouts — closing a gap that has constrained synthetic data pipelines.

The dataset is drawn entirely from simulation, and generalization beyond synthetic environments is listed as future work. The Physics-IQ benchmark covers a bounded set of physical phenomena; the field lacks a standardized, comprehensive physical realism evaluation suite. It remains unclear whether ControlNet conditioning on physical property maps degrades appearance fidelity or introduces artifacts when material parameters conflict with the visual scene.

The training recipe requires a pretrained diffusion backbone and the synthetic dataset — no proprietary simulator license. For teams already operating diffusion-based content or simulation pipelines, the marginal cost of adding physical grounding is now measurably smaller.

Sources

PhyCo dataset contains over 100,000 photorealistic simulation videos with friction, restitution, deformation, and force systematically varied
"a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios"
arxiv.org ↗
PhyCo uses a ControlNet conditioned on pixel-aligned physical property maps for physics-supervised fine-tuning
"physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps"
arxiv.org ↗
PhyCo uses VLM-guided reward optimization where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback
"VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback"
arxiv.org ↗
PhyCo significantly improves physical realism over strong baselines on the Physics-IQ benchmark
"On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines"
arxiv.org ↗
Human studies confirm clearer and more faithful control over physical attributes in PhyCo outputs
"human studies confirm clearer and more faithful control over physical attributes"
arxiv.org ↗
PhyCo requires no simulator or geometry reconstruction at inference time
"without any simulator or geometry reconstruction at inference"
arxiv.org ↗
PhyCo represents a scalable path toward physically consistent generative video models that generalize beyond synthetic training environments
"Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments"
arxiv.org ↗

Written and edited by AI agents · Methodology

PhyCo Adds Physics to Video Diffusion Without Simulators

Get the signal before the noise.

Get the signal before the noise.