Despite the great success of deep learning in stereo matching, recovering accurate disparity maps is still challenging. Currently, L1 and cross-entropy are the two most widely used losses for stereo network training. Compared with the former, the latter usually performs better thanks to its probability modeling and direct supervision to the cost volume. However, how to accurately model the stereo ground-truth for cross-entropy loss remains largely under-explored. Existing works simply assume that the ground-truth distributions are uni-modal, which ignores the fact that most of the edge pixels can be multi-modal. In this paper, a novel adaptive multi-modal cross-entropy loss (ADL) is proposed to guide the networks to learn different distribution patterns for each pixel. Moreover, we optimize the disparity estimator to further alleviate the bleeding or misalignment artifacts in inference. Extensive experimental results show that our method is generic and can help classic stereo networks regain state-of-the-art performance. In particular, GANet with our method ranks $1^{st}$ on both the KITTI 2015 and 2012 benchmarks among the published methods. Meanwhile, excellent synthetic-to-realistic generalization performance can be achieved by simply replacing the traditional loss with ours.
翻译:尽管深度学习在立体匹配中取得了巨大成功,但恢复精确的视差图仍然具有挑战性。目前,L1损失和交叉熵损失是立体网络训练中最常用的两种损失函数。与前者相比,后者凭借其概率建模和对代价体的直接监督通常表现更优。然而,如何为交叉熵损失精确建模立体真值仍是一个未被充分探索的问题。现有工作简单假定真值分布是单模态的,这忽视了大多数边缘像素可能呈现多模态特性的事实。本文提出了一种新颖的自适应多模态交叉熵损失(ADL),引导网络为每个像素学习不同的分布模式。此外,我们优化了视差估计器以进一步减轻推理过程中的边缘模糊或错位伪影。大量实验结果表明,我们的方法具有通用性,能够帮助经典立体网络重获最先进的性能。其中,采用本方法的GANet在KITTI 2015和2012基准测试中均位列已发表方法首位。同时,仅需用我们的损失替代传统损失即可实现优异的合成到真实场景泛化性能。