HyVGGT-VO: Tightly Coupled Hybrid Dense Visual Odometry with Feed-Forward Models

Dense visual odometry (VO), which provides pose estimation and dense 3D reconstruction, serves as the cornerstone for applications ranging from robotics to augmented reality. Recently, feed-forward models have demonstrated remarkable capabilities in dense mapping. However, when these models are used in dense visual SLAM systems, their heavy computational burden restricts them to yielding sparse pose outputs at keyframes while still failing to achieve real-time pose estimation. In contrast, traditional sparse methods provide high computational efficiency and high-frequency pose outputs, but lack the capability for dense reconstruction. To address these limitations, we propose HyVGGT-VO, a novel framework that combines the computational efficiency of sparse VO with the dense reconstruction capabilities of feed-forward models. To the best of our knowledge, this is the first work to tightly couple a traditional VO framework with VGGT, a state-of-the-art feed-forward model. Specifically, we design an adaptive hybrid tracking frontend that dynamically switches between traditional optical flow and the VGGT tracking head to ensure robustness. Furthermore, we introduce a hierarchical optimization framework that jointly refines VO poses and the scale of VGGT predictions to ensure global scale consistency. Our approach achieves an approximately 5x processing speedup compared to existing VGGT-based methods, while reducing the average trajectory error by 85% on the indoor EuRoC dataset and 12% on the outdoor KITTI benchmark. Our code will be publicly available upon acceptance. Project page: https://geneta2580.github.io/HyVGGT-VO.io.

翻译：密集视觉里程计（VO）通过提供位姿估计与密集三维重建，成为从机器人到增强现实等应用领域的基石。近期，前馈模型在密集建图方面展现出卓越能力。然而，当这些模型用于密集视觉SLAM系统时，其沉重的计算负担使其仅能在关键帧处输出稀疏位姿，仍无法实现实时位姿估计。相比之下，传统稀疏方法具有高计算效率与高频位姿输出能力，但缺乏密集重建功能。为突破上述局限，我们提出HyVGGT-VO——一种融合稀疏VO计算效率与前馈模型密集重建能力的新型框架。据我们所知，这是首个将传统VO框架与VGGT（当前最先进的前馈模型）进行紧耦合的工作。具体而言，我们设计了自适应混合跟踪前端，可在传统光流法与VGGT跟踪头之间动态切换以确保鲁棒性。此外，我们引入分层优化框架，联合优化VO位姿与VGGT预测的尺度因子，从而保证全局尺度一致性。相较现有基于VGGT的方法，本方法实现约5倍处理速度提升，同时在室内EuRoC数据集上平均轨迹误差降低85%，室外KITTI基准测试上降低12%。代码将在论文被接收后开源。项目主页：https://geneta2580.github.io/HyVGGT-VO.io。