LoRA3D：三维几何基础模型的低秩自校准 (LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models)

Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to $\textit{specialize}$ the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a $\textbf{single standard GPU within just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view rendering.

翻译：新兴的三维几何基础模型，如DUSt3R，为野外三维视觉任务提供了一种前景广阔的方法。然而，由于问题空间的高维特性以及高质量三维数据的稀缺性，这些预训练模型在泛化到许多具有挑战性的场景时仍存在困难，例如视角重叠有限或光照不足。为解决此问题，我们提出了LoRA3D，一种高效的自校准流程，旨在利用模型自身的多视图预测将预训练模型$\textit{专门化}$到目标场景。以稀疏RGB图像作为输入，我们利用鲁棒优化技术来优化多视图预测并将其对齐到全局坐标系中。特别地，我们将预测置信度整合到几何优化过程中，自动重新加权置信度以更好地反映点估计的准确性。我们使用校准后的置信度为校准视图生成高质量的伪标签，并利用低秩自适应（LoRA）在伪标签数据上对模型进行微调。我们的方法不需要任何外部先验知识或人工标注。它能在$\textbf{单个标准GPU上仅用5分钟}$完成自校准过程。每个低秩适配器仅需$\textbf{18MB}$的存储空间。我们在来自Replica、TUM和Waymo Open数据集的$\textbf{超过160个场景}$上评估了我们的方法，在三维重建、多视图姿态估计和新视角渲染任务上实现了高达$\textbf{88%}$的性能提升。