We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.
翻译:我们提出了无限世界(Infinite-World),一种鲁棒的交互式世界模型,能够在复杂真实世界环境中维持超过1000帧的连贯视觉记忆。尽管现有世界模型可在具有完美真值的合成数据上高效优化,但由于噪声姿态估计和视角重访的稀缺性,它们缺乏针对真实世界视频的有效训练范式。为弥合这一差距,我们首先引入了分层无姿态记忆压缩器(HPMC),该模块将历史隐变量递归蒸馏为固定预算的表征。通过将压缩器与生成主干网络联合优化,HPMC使模型能够以有限的计算成本自主锚定在遥远过去的生成过程,无需显式几何先验。其次,我们提出了不确定性感知动作标注模块,将连续运动离散化为三态逻辑。该策略在最大化利用原始视频数据的同时,保护确定性动作空间免受噪声轨迹污染,确保鲁棒的动作响应学习。此外,基于初步玩具实验的启示,我们采用重访密集微调策略,利用紧凑的30分钟数据集高效激活模型的长程闭环能力。大量实验(包括客观指标和用户研究)表明,无限世界在视觉质量、动作可控性和空间一致性方面均取得卓越性能。