In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.
翻译:本文提出了一种新颖的训练方法,使得任何单目深度网络仅通过常规训练数据(即驾驶视频)即可学习绝对尺度并估计度量化的道路场景深度。我们将此训练框架称为FUMET。其核心思想是利用道路上的车辆作为尺度监督源,并将其稳健地融入网络训练中。FUMET通过检测并估计图像帧中车辆的尺寸,从中提取尺度信息并聚合为相机高度的估计值,通过强制整个视频序列中相机高度的一致性来实现尺度监督。这使得任何原本忽略尺度的单目深度网络能够实现稳健的无监督训练,使其不仅具备尺度感知能力,还能实现无需辅助传感器和额外监督的度量级精度。在KITTI和Cityscapes数据集上的大量实验证明了FUMET的有效性,其达到了最先进的精度水平。我们还表明,FUMET支持在不同相机高度的混合数据集上进行训练,从而实现更大规模的训练和更好的泛化能力。度量深度重建对于任何道路场景视觉建模都至关重要,而FUMET通过建立将任意模型转换为度量深度估计器的方法,实现了该技术的普及化应用。