ActCam Controls Video Cameras and Characters Without Fine-Tuning

ActCam, a method developed by researchers at Oxford and INRIA, controls character motion and camera movement in AI-generated video without requiring any model fine-tuning. The system works entirely at inference time. Given a reference image, a source video showing desired character movement, and a target camera path, it generates the geometry and pose signals needed by any pretrained image-to-video diffusion model that already accepts depth and pose conditioning.

The technique chains five stages. First, the reference character is removed from the scene and a depth estimator constructs a background mesh. A 3D motion estimator then recovers articulated motion from the acting video. This recovered motion aligns to the background via depth transformation, and both pose and depth signals are rasterized under the target camera. Finally, a two-phase denoising schedule applies depth-plus-pose conditioning in early steps to lock structure and viewpoint, then drops depth and uses pose-only guidance in later refinement steps.

FIG. 02 ActCam chains five stages from reference video to controlled output, using depth+pose early then pose-only guidance. — arxiv.org/abs/2605.06667v1

On static-camera benchmarks, ActCam scores 86.47 across VBench metrics, above SteadyDancer (85.15), VACE (85.33), and HumanVid (84.68). On joint camera-and-character tests, it posts 0.8497 versus 0.8370 for Uni3C and 0.8351 for RealisDance DiT. It achieves the lowest MPJPE (0.2087) among all three, indicating the tightest motion fidelity. In human preference testing with 17 evaluators, ActCam was preferred over Uni3C on motion quality (66.9% vs. 24.1%), camera adherence (53.1% vs. 27.8%), and visual quality (53.2% vs. 36.7%).

FIG. 03 ActCam outperforms baseline methods on static-camera VBench benchmarks. — arxiv.org/abs/2605.06667v1

The no-fine-tuning model cuts deployment friction. Existing investments in image-to-video models deployed in production can now accept full cinematographic control by layering ActCam's preprocessing stack on top. The architecture integrates with standard production presets: arc left/right, vertigo, handheld, zoom-in, swing-zoom, and 45-degree arcs. Automated video production workflows can parameterize camera work programmatically rather than through prompting or manual input.

Limits exist. The human preference study tested only 17 evaluators, a thin sample for perceptual tasks. The paper describes a research prototype with no announced production SDK or API. Performance under heavy occlusion, multi-character scenes, or long sequences is not benchmarked. The geometric consistency metric (Sampson Error) shows ActCam at 0.4546, slightly behind RealisDance DiT's 0.4528, indicating trade-offs in epipolar accuracy during large camera motion.

Code and the project page are live now. For AI platform teams evaluating video generation infrastructure, the test is whether ActCam's conditioning pipeline integrates with their existing diffusion backbone. No fine-tuning means an engineering sprint, not a training run.

Sources

ActCam is a zero-shot method for jointly controlling character motion and camera trajectory in video generation, working without fine-tuning any diffusion model
"We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters."
arxiv.org ↗
ActCam uses a two-phase conditioning schedule: early steps use depth+pose, later steps drop depth and use pose-only guidance
"early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation"
arxiv.org ↗
ActCam scores an aggregate 86.47 on static-camera motion-control benchmarks, above SteadyDancer (85.15), VACE (85.33), and HumanVid (84.68)
"ActCam (Ours) 86.47 95.28 95.83 58.66 70.83 98.88 99.34"
elkhomar.github.io ↗
On joint camera-and-character benchmarks ActCam posts aggregate 0.8497 vs 0.8370 for Uni3C and 0.8351 for RealisDance DiT, with lowest MPJPE of 0.2087
"ActCam (Ours) 0.8497 0.9212 0.9350 0.5767 0.7212 0.9571 0.9872 0.2087 0.4546"
elkhomar.github.io ↗
In a 2AFC human preference study with 17 users, ActCam was preferred over Uni3C on motion (66.9% vs 24.1%), camera (53.1% vs 27.8%), and visual quality (53.2% vs 36.7%)
"Camera 53.1% 27.8% 19.1% Motion 66.9% 24.1% 9.0% Visual Quality 53.2% 36.7% 10.1%"
elkhomar.github.io ↗
MoGe is used for monocular depth estimation to build a background-only 3D mesh, and GVHMR recovers 3D human motion from the acting video
"A monocular depth estimator (MoGe) creates a background-only 3D mesh... A monocular 3D human motion estimator (GVHMR) recovers an articulated motion sequence from the acting video"
elkhomar.github.io ↗
ActCam's Sampson Error is 0.4546, slightly behind RealisDance DiT's 0.4528
"RealisDance DiT 0.8351 0.9209 0.9342 0.5417 0.6448 0.9803 0.9888 0.2123 0.4528"
elkhomar.github.io ↗
ActCam is a pure inference-time method requiring no fine-tuning, only carefully constructed conditioning signals fed to a pretrained backbone
"ActCam is a pure inference-time method. No finetuning required—just carefully constructed conditioning signals fed to a pretrained backbone."
elkhomar.github.io ↗

Written and edited by AI agents · Methodology

ActCam Controls Video Cameras and Characters Without Fine-Tuning

Get the signal before the noise.

Get the signal before the noise.