Without ground truth supervision, self-supervised depth estimation can be trapped in a local minimum due to the gradient-locality issue of the photometric loss. In this paper, we present a framework to enhance depth by leveraging semantic segmentation to guide the network to jump out of the local minimum. Prior works have proposed to share encoders between these two tasks or explicitly align them based on priors like the consistency between edges in the depth and segmentation maps. Yet, these methods usually require ground truth or high-quality pseudo labels, which may not be easily accessible in real-world applications. In contrast, we investigate self-supervised depth estimation along with a segmentation branch that is supervised with noisy labels provided by models pre-trained with limited data. We extend parameter sharing from the encoder to the decoder and study the influence of different numbers of shared decoder parameters on model performance. Also, we propose to use cross-task information to refine current depth and segmentation predictions to generate pseudo-depth and semantic labels for training. The advantages of the proposed method are demonstrated through extensive experiments on the KITTI benchmark and a downstream task for endoscopic tissue deformation tracking.
翻译:在没有真实深度监督的情况下,自监督深度估计会因光度损失的梯度局部性问题而陷入局部极小值。本文提出一种利用语义分割引导网络跳出局部极小值以增强深度估计的框架。先前研究提出在这两个任务间共享编码器,或基于深度图与分割图边缘一致性等先验进行显式对齐。然而,这些方法通常需要真实标签或高质量伪标签,这在现实应用中难以轻易获取。与此相反,我们研究自监督深度估计与一个分割分支的结合,该分支使用有限数据预训练模型提供的噪声标签进行监督。我们将参数共享从编码器扩展至解码器,并研究不同共享解码器参数数量对模型性能的影响。此外,我们提出利用跨任务信息精化当前深度与分割预测,以生成用于训练的伪深度和语义标签。通过在KITTI基准测试及内窥镜组织变形跟踪下游任务上的大量实验,证明了所提方法的优势。