We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.
翻译:我们提出了CAPA,一种参数高效的测试时优化框架,该框架利用稀疏几何线索,将预训练的3D基础模型(FMs)自适应地用于深度补全任务。与先前方法不同——那些方法通常为辅助输入训练任务特定的编码器,容易过拟合且泛化能力差——CAPA冻结了FM主干网络。相反,它仅更新一个极小的参数集,采用参数高效微调技术(例如LoRA或VPT),其更新方向由推理时可用的稀疏观测数据直接计算出的梯度所引导。这种方法有效地将基础模型的几何先验锚定于场景特定的测量中,从而校正畸变和错误放置的结构。对于视频数据,CAPA引入了序列级参数共享,联合自适应所有帧以利用时间相关性,提高鲁棒性,并强制实施多帧一致性。CAPA是模型无关的,兼容任何基于ViT的FM,并在室内和室外数据集上的多种条件模式中取得了最先进的结果。项目页面:research.nvidia.com/labs/dvl/projects/capa。