Walk Through Paintings : Ego-centric World Models from Internet Priors

Anurag Bagchi1, Zhipeng Bao1, Homanga Bharadhwaj1, Yu-Xiong Wang2, Pavel Tokmakov3, Martial Hebert1

1Carnegie Mellon University, 2UIUC, 3TRI

arXiv Code (Soon)
TL;DR Initial Frame (I0) + Action Trajectory (A 1:T) → Future Frames (I1:T)
We present Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pre-trained video diffusion model into an action-conditioned world model, enabling controllable prediction of the future. Rather than training from scratch, we re-purpose the rich world priors of Internet-scale video models, injecting motor commands through lightweight conditioning layers. This allows our model to follow actions faithfully, while preserving generalization and realism. Our approach scales naturally across embodiments and action spaces — from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle–driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation, requiring only modest fine-tuning. To evaluate physical correctness independent of appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. Our method improves SCS by up to 80% over prior state-of-the-art (Navigation World Models) while exhibiting up to 6x lower latency and generalizing robustly to unseen environments — including navigation inside paintings.

Zero-shot Generalisation to Paintings!

3-DoF Position Control Using

*The Wan model is still under development. These are early results
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction
Initial Frame
Trajectory
Prediction
Trajectory
Prediction

25-DoF Humanoid Joint Angle Control Using

*The Wan model is still under development. These are early results
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction
Input Action Sequence
Initial Frame
Prediction
Initial Frame
Prediction

Zero-shot Generalisation to Real-World Images Captured by Us

25-DoF Joint Angle control

Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction
Input Action Sequence
Initial frame
Prediction
Initial frame
Prediction

25-DoF Humanoid Joint Angle Control Results on 1x Validation Set

Manipulation

Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)
Initial Frame + Action Traj.
GT Video
Ours (Cosmos)

3-DoF Position Control Comparison on RECON Test Set

Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM
Initial Frame + Action Traj.
GT Video
Ours (SVD)
Ours (Cosmos)
NWM