Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion -- without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.
翻译:通用度量深度理解对于精确的视觉引导机器人技术至关重要,而当前最先进的视觉编码器尚不支持该功能。为解决此问题,我们提出了一种自监督训练方法,通过为预训练RGB编码器扩展深度适配器,将度量深度信息融入并对齐到联合潜在空间中,同时不干扰预训练的RGB特征提取。结合我们提出的正弦深度编码方法,该深度适配器能够实现通用且鲁棒的深度密度与分布不变性特征提取。我们的深度适配器在分割、姿态估计和深度补全等一系列相关RGBD下游任务中,显著提升了多种通用RGB基线的性能——且无需微调。最重要的是,我们在SUN-RGBD分割任务上取得了56.05 mIoU,同时在我们实验中优于当前最先进的深度感知和多模态编码器。当深度信息缺失时,可通过空地图激活深度适配器,或利用单像素深度线索、单目深度估计等方式,将深度感知特征提取纳入后续下游任务中。