On the Viability of Monocular Depth Pre-training for Semantic Segmentation

We explore how pre-training a model to infer depth from a single image compares to pre-training the model for a semantic task, e.g. ImageNet classification, for the purpose of downstream transfer to semantic segmentation. The question of whether pre-training on geometric tasks is viable for downstream transfer to semantic tasks is important for two reasons, one practical and the other scientific. In practice, if it were viable, one could reduce pre-training costs and bias due to human annotation at scale. If, however, it were not, then that would affirm human annotation as an inductive vehicle so powerful to justify the annotation effort. Yet the bootstrapping question would still be unanswered: How did the ability to assign labels to semantically coherent regions emerge? If pre-training on a geometric task was sufficient to prime a notion of 'object', leveraging the regularities of the environment (what Gibson called 'detached objects'), that would reduce the gap to semantic inference as a matter of aligning labels, which could be done with few examples. To test these hypotheses, we have designed multiple controlled experiments that require minimal fine-tuning, using common benchmarks such as KITTI, Cityscapes, and NYU-V2: We explore different forms of supervision for depth estimation, training pipelines, and data resolutions for semantic fine-tuning. We find that depth pre-training exceeds performance relative to ImageNet pre-training on average by 5.8% mIoU and 5.2% pixel accuracy. Surprisingly, we find that optical flow estimation, which is a closely related task to depth estimation as it optimizes the same photometric reprojection error, is considerably less effective.

翻译：我们探究了为语义分割下游迁移任务而对模型进行单目图像深度推断预训练，与为其执行语义任务（如ImageNet分类）预训练之间的对比效果。几何任务预训练对语义任务下游迁移是否可行这一问题的研究具有重要意义，其价值兼具实践与科学两个层面。从实践角度看，若该方案可行，可降低预训练成本及大规模人工标注带来的偏差；反之则印证了人工标注作为归纳驱动工具的强大效力，使其标注投入具有合理性。然而，自举问题仍未解决：为语义一致区域标注标签的能力究竟如何产生？若几何任务预训练足以激发“物体”概念（吉布森提出的“离散物体”），利用环境的规律性，则可将语义推断差距简化为标签对齐问题（仅需少量样本即可实现）。为验证这些假设，我们设计了多个需极少量微调的对照实验，采用KITTI、Cityscapes和NYU-V2等通用基准：探究深度估计的不同监督形式、训练流程及语义微调的数据分辨率。实验表明，深度预训练相比ImageNet预训练，平均mIoU提升5.8%，像素准确率提升5.2%。值得注意的是，与深度估计密切相关的光流估计（因优化相同的光度重投影误差）效果显著较差。