3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at https://github.com/Xian-Bei/TALO.
翻译:三维视觉基础模型通过单次前向传播从未标定图像中重建关键三维属性方面展现出强大的泛化能力。然而,当部署在驾驶场景等在线环境中时,预测需在时间窗口内进行,这使得保持跨时间一致性变得尤为重要。现有策略通过求解全局变换来对齐连续预测,但我们的分析揭示了其在假设有效性、局部对齐范围以及噪声几何条件下的鲁棒性方面存在根本性局限。本文提出一种基于薄板样条的高自由度长期对齐框架,利用全局传播的控制点来校正空间变化的不一致性。此外,我们采用了一种与点云无关的子图配准设计,该设计对噪声几何预测具有内在鲁棒性。所提框架完全即插即用,兼容多种三维基础模型及相机配置(如单目或环视系统)。大量实验表明,我们的方法在多个数据集、骨干模型和相机配置下均能持续产生更连贯的几何结构和更低的轨迹误差,凸显了其鲁棒性与普适性。代码已公开于 https://github.com/Xian-Bei/TALO。