We present Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pre-trained video diffusion model into an action-conditioned world model, enabling controllable prediction of the future. Rather than training from scratch, we re-purpose the rich world priors of Internet-scale video models, injecting motor commands through lightweight conditioning layers. This allows our model to follow actions faithfully, while preserving generalization and realism. Our approach scales naturally across embodiments and action spaces — from 3-DoF mobile robots to 25-DoFhumanoids, where predicting egocentric joint-angle–driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation, requiring only modest fine-tuning. To evaluate physical correctness independent of appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. Our method improves SCS by up to 80% over prior state-of-the-art (Navigation World Models) while exhibiting up to 6x lower latency and generalizing robustly to unseen environments — including navigation inside paintings.
Zero-shot Generalisation to Paintings!
3-DoF Position Control Using
*The Wan model is still under development. These are early results
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
25-DoF Humanoid Joint Angle Control Using
*The Wan model is still under development. These are early results
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Zero-shot Generalisation to Real-World Images Captured by Us
25-DoF Joint Angle control
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
25-DoF Humanoid Joint Angle Control Results on 1x Validation Set
Manipulation
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
3-DoF Position Control Comparison on RECON Test Set