Ego-centric World Model

We present Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pre-trained video diffusion model into an action-conditioned world model, enabling controllable prediction of the future. Rather than training from scratch, we re-purpose the rich world priors of Internet-scale video models, injecting motor commands through lightweight conditioning layers. This allows our model to follow actions faithfully, while preserving generalization and realism. Our approach scales naturally across embodiments and action spaces — from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle–driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation, requiring only modest fine-tuning. To evaluate physical correctness independent of appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. Our method improves SCS by up to 80% over prior state-of-the-art (Navigation World Models) while exhibiting up to 6x lower latency and generalizing robustly to unseen environments — including navigation inside paintings.

Zero-shot Generalisation to Paintings!

3-DoF Position Control Using

*The Wan model is still under development. These are early results

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

Initial Frame

Trajectory

Prediction

Trajectory

Prediction

25-DoF Humanoid Joint Angle Control Using

*The Wan model is still under development. These are early results

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Input Action Sequence

Initial Frame

Prediction

Initial Frame

Prediction

Zero-shot Generalisation to Real-World Images Captured by Us

25-DoF Joint Angle control

Input Action Sequence

Initial frame

Prediction

Initial frame

Prediction

Input Action Sequence

Initial frame

Prediction

Initial frame

Prediction

Input Action Sequence

Initial frame

Prediction

Initial frame

Prediction

Input Action Sequence

Initial frame

Prediction

Initial frame

Prediction

25-DoF Humanoid Joint Angle Control Results on 1x Validation Set

Manipulation

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

Initial Frame + Action Traj.

GT Video

Ours (Cosmos)

3-DoF Position Control Comparison on RECON Test Set

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

NWM

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

NWM

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

NWM

Initial Frame + Action Traj.

GT Video

Ours (SVD)

Ours (Cosmos)

NWM

Walk Through Paintings : Ego-centric World Models from Internet Priors

Anurag Bagchi¹, Zhipeng Bao¹, Homanga Bharadhwaj¹, Yu-Xiong Wang², Pavel Tokmakov³, Martial Hebert¹

¹Carnegie Mellon University, ²UIUC, ³TRI

Zero-shot Generalisation to Paintings!

3-DoF Position Control Using

25-DoF Humanoid Joint Angle Control Using

Zero-shot Generalisation to Real-World Images Captured by Us

25-DoF Joint Angle control

25-DoF Humanoid Joint Angle Control Results on 1x Validation Set

Manipulation

3-DoF Position Control Comparison on RECON Test Set