Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S$^2$VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.
翻译:现代视觉世界建模系统日益依赖高容量架构和大规模数据生成合理的运动,但往往无法保持底层三维几何结构或物理一致的相机动力学。关键局限不仅在于模型容量,更在于用于编码几何结构的潜在表征。我们提出S$^2$VAE——一种几何优先的潜在学习框架,专注于压缩和表征场景的三维潜在状态(包括相机运动、深度与点级结构),而非仅建模外观。基于视觉几何奠基变压器(VGGT)的表征,我们引入一种新型变分自编码器,采用幂球型潜在分布的乘积,在瓶颈处显式强制超球面结构,以在强压缩下保持方向性与几何语义。在深度估计、相机姿态恢复和点云重建任务中,我们证明几何对齐的超球面潜在表征在传统高斯瓶颈上始终表现更优,特别是在高压缩率场景下。我们的结果凸显了潜在几何作为物理驱动视觉与世界模型的一级设计选择。