Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
翻译:扩展现实(XR)需要能够响应用户真实世界运动追踪的生成模型,然而当前的视频世界模型仅接受文本或键盘输入等粗粒度控制信号,这限制了其在具身交互中的实用性。我们提出了一种以人为中心的视频世界模型,该模型同时以追踪的头部姿态和关节级手部姿态为条件。为此,我们评估了现有的扩散Transformer条件控制策略,并提出了一种有效的三维头部与手部控制机制,实现了灵巧的手-物体交互。我们使用此策略训练了一个双向视频扩散模型教师,并将其蒸馏为一个因果、交互式的系统,用于生成以自我为中心的虚拟环境。我们通过人类受试者评估了该生成现实系统,结果表明,与相关基线相比,该系统不仅提高了任务执行性能,而且用户对执行动作的感知控制水平也显著更高。