Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.
翻译:基于能量的预测世界模型通过推理潜在能量景观而非生成像素,为多步视觉规划提供了强大方法。然而,现有方法面临两大挑战:(i) 其潜在表示通常在欧几里得空间中学习,忽略了状态间潜在的几何与层次结构;(ii) 它们难以进行长时程预测,导致在扩展推演中性能迅速退化。为解决这些挑战,我们提出了GeoWorld——一种通过双曲JEPA将潜在表示从欧几里得空间映射到双曲流形上,从而保持几何结构与层次关系的几何世界模型。我们进一步引入了基于能量的几何强化学习进行优化,实现在双曲潜在空间中的稳定多步规划。在CrossTask和COIN数据集上的大量实验表明,相较于最先进的V-JEPA 2模型,本方法在3步规划中实现了约3%的成功率提升,在4步规划中实现了2%的成功率提升。项目网站:https://steve-zeyu-zhang.github.io/GeoWorld。