Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
翻译:第一人称交互世界模型对于增强现实和具身人工智能至关重要,其视觉生成需以低延迟、几何一致性和长期稳定性响应用户输入。我们研究在自由空间手势下从单张场景图像生成第一人称交互,旨在合成逼真视频,其中手部进入场景、与物体交互,并在头部运动下引发合理的世界动态。该设定带来若干基础性挑战:包括自由空间手势与高接触度训练数据间的分布偏移、单目视图中手部运动与相机运动的模糊性,以及任意长度视频生成的需求。我们提出Hand2World——一个统一的自回归框架,通过基于投影3D手部网格的遮挡不变手部条件化来应对这些挑战,使得可见性与遮挡可从场景上下文推断而非编码于控制信号中。为稳定第一人称视角变化,我们通过逐像素普吕克射线嵌入注入显式相机几何,解耦相机运动与手部运动并防止背景漂移。我们进一步开发了全自动单目标注流程,并将双向扩散模型蒸馏为因果生成器,从而实现任意长度合成。在三个第一人称交互基准上的实验表明,该方法在感知质量与3D一致性方面取得显著提升,同时支持相机控制与长时程交互生成。