Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
翻译:单目度量深度估计(MMDE)对于物理智能系统至关重要,然而在复杂场景中为欠表征类别实现精确的深度估计仍是一个持续存在的挑战。为解决此问题,我们提出RAD,一种检索增强框架,通过利用检索到的邻近样本作为结构几何代理,近似实现了多视图立体视觉的优势。我们的方法首先采用一种不确定性感知检索机制,以识别输入中的低置信度区域,并检索包含语义相似内容的RGB-D上下文样本。随后,我们通过一个双流网络同时处理输入和检索到的上下文,并利用一个匹配的交叉注意力模块进行融合,该模块仅在可靠的点对应关系处传递几何信息。在NYU Depth v2、KITTI和Cityscapes数据集上的评估表明,RAD在欠表征类别上显著优于现有最先进的基线方法,在NYU Depth v2上将相对绝对误差降低了29.2%,在KITTI上降低了13.3%,在Cityscapes上降低了7.2%,同时在标准的领域内基准测试中保持了有竞争力的性能。