Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.
翻译:空间感知旨在从视觉观测中估计相机运动与场景结构,这一传统上通过几何建模与物理一致性约束解决的问题。近年来,基于学习的方法已展现出强大的几何感知表征能力,并在实践中越来越多地用于增强经典的以几何为中心的系统。然而,学习组件是应直接替代几何估计,还是应作为此类流程中的中间模块,仍是一个开放性问题。在本工作中,我们针对这一空白展开研究,探讨了一种用于有效空间推理的端到端模块化框架,其中学习提出几何假设,而几何算法则作出估计决策。具体而言,我们在RGB-D序列的相对相机姿态估计背景下研究这一原则。使用VGGT作为代表性学习模型,我们评估了在不同运动幅度和场景动态下的基于学习的姿态与深度假设,随后采用经典的点到平面RGB-D ICP作为几何后端。我们在TUM RGB-D基准测试上的实验揭示了三个一致的发现:(1) 仅依赖基于学习的姿态假设是不可靠的;(2) 当学习提出的几何与相机内参未正确对齐时,可能降低性能;(3) 当学习提出的深度经过几何对齐并随后进行几何裁决阶段时,在中等挑战性的刚性场景中会出现一致的性能提升。这些结果表明,几何不仅仅是一个细化组件,更是一个验证并吸收基于学习的几何观测的关键仲裁者。我们的研究强调了模块化、几何感知的系统设计对于鲁棒空间感知的重要性。