Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
翻译:生成式世界模型在模拟动态环境方面展现出潜力,然而由于视角快速变化、频繁的手-物交互以及其演进依赖于潜在人类意图的目标导向过程,自我中心视频仍具有挑战性。现有方法要么聚焦于场景演化有限的手部中心教学合成,要么在不建模动作动力学的情况下执行静态视角转换,或依赖密集监督(如相机轨迹、长视频前缀、同步多相机拍摄等)。在本工作中,我们提出EgoForge——一种目标导向的自我中心世界模拟器,能从最少的静态输入(单张自我中心图像、一条高层级指令及可选的辅助外部视角)生成连贯的第一人称视频推演。为提升意图对齐与时间一致性,我们提出VideoDiffusionNFT,一种轨迹级别的奖励引导精调方法,能在扩散采样过程中优化目标完成度、时间因果性、场景一致性及感知保真度。大量实验表明,EgoForge在语义对齐、几何稳定性与运动保真度上相较于强基线模型取得一致提升,并在真实世界智能眼镜实验中展现出稳健性能。