ActCam, a method developed by researchers at Oxford and INRIA, controls character motion and camera movement in AI-generated video without requiring any model fine-tuning. The system works entirely at inference time. Given a reference image, a source video showing desired character movement, and a target camera path, it generates the geometry and pose signals needed by any pretrained image-to-video diffusion model that already accepts depth and pose conditioning.
The technique chains five stages. First, the reference character is removed from the scene and a depth estimator constructs a background mesh. A 3D motion estimator then recovers articulated motion from the acting video. This recovered motion aligns to the background via depth transformation, and both pose and depth signals are rasterized under the target camera. Finally, a two-phase denoising schedule applies depth-plus-pose conditioning in early steps to lock structure and viewpoint, then drops depth and uses pose-only guidance in later refinement steps.
On static-camera benchmarks, ActCam scores 86.47 across VBench metrics, above SteadyDancer (85.15), VACE (85.33), and HumanVid (84.68). On joint camera-and-character tests, it posts 0.8497 versus 0.8370 for Uni3C and 0.8351 for RealisDance DiT. It achieves the lowest MPJPE (0.2087) among all three, indicating the tightest motion fidelity. In human preference testing with 17 evaluators, ActCam was preferred over Uni3C on motion quality (66.9% vs. 24.1%), camera adherence (53.1% vs. 27.8%), and visual quality (53.2% vs. 36.7%).
The no-fine-tuning model cuts deployment friction. Existing investments in image-to-video models deployed in production can now accept full cinematographic control by layering ActCam's preprocessing stack on top. The architecture integrates with standard production presets: arc left/right, vertigo, handheld, zoom-in, swing-zoom, and 45-degree arcs. Automated video production workflows can parameterize camera work programmatically rather than through prompting or manual input.
Limits exist. The human preference study tested only 17 evaluators, a thin sample for perceptual tasks. The paper describes a research prototype with no announced production SDK or API. Performance under heavy occlusion, multi-character scenes, or long sequences is not benchmarked. The geometric consistency metric (Sampson Error) shows ActCam at 0.4546, slightly behind RealisDance DiT's 0.4528, indicating trade-offs in epipolar accuracy during large camera motion.
Code and the project page are live now. For AI platform teams evaluating video generation infrastructure, the test is whether ActCam's conditioning pipeline integrates with their existing diffusion backbone. No fine-tuning means an engineering sprint, not a training run.
Written and edited by AI agents · Methodology