Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
翻译:自监督学习方法对未标注数据进行操作,以学习适用于下游任务的稳健表征。大多数自监督学习方法依赖于通过变换二维图像像素图获得的增强数据。这些增强方法忽略了生物视觉发生在沉浸式三维、时间连续的环境中,且低级生物视觉高度依赖深度线索这一事实。利用预训练顶尖单目RGB到深度模型(《深度预测Transformer》,Ranftl等人,2021)提供的信号,我们探索了两种将深度信号融入自监督学习框架的不同方法。首先,我们评估了使用RGB+深度输入表征的对比学习。其次,我们利用深度信号从略微不同的相机位置生成新视角,从而为对比学习产生三维增强。我们使用ImageNette(ImageNet的10类子集)、ImageNet-100和ImageNet-1k数据集,在三种不同的自监督学习方法——BYOL、SimSiam和SwAV上评估了这两种方法。我们发现,这两种融入深度信号的方法均提升了基础自监督学习方法的稳健性和泛化能力,但第一种方法(深度通道拼接)表现更优。例如,加入额外深度通道的BYOL在ImageNette上的下游分类准确率从85.3%提升至88.0%,在ImageNet-C上从84.1%提升至87.0%。