Despite the great success of deep learning in stereo matching, recovering accurate and clearly-contoured disparity map is still challenging. Currently, L1 loss and cross-entropy loss are the two most widely used loss functions for training the stereo matching networks. Comparing with the former, the latter can usually achieve better results thanks to its direct constraint to the the cost volume. However, how to generate reasonable ground-truth distribution for this loss function remains largely under exploited. Existing works assume uni-modal distributions around the ground-truth for all of the pixels, which ignores the fact that the edge pixels may have multi-modal distributions. In this paper, we first experimentally exhibit the importance of correct edge supervision to the overall disparity accuracy. Then a novel adaptive multi-modal cross-entropy loss which encourages the network to generate different distribution patterns for edge and non-edge pixels is proposed. We further optimize the disparity estimator in the inference stage to alleviate the bleeding and misalignment artifacts at the edge. Our method is generic and can help classic stereo matching models regain competitive performance. GANet trained by our loss ranks 1st on the KITTI 2015 and 2012 benchmarks and outperforms state-of-the-art methods by a large margin. Meanwhile, our method also exhibits superior cross-domain generalization ability and outperforms existing generalization-specialized methods on four popular real-world datasets.
翻译:尽管深度学习在立体匹配中取得了巨大成功,但恢复清晰且轮廓分明的视差图仍具挑战性。目前,L1损失和交叉熵损失是训练立体匹配网络最常用的两种损失函数。相较前者,后者由于直接约束代价体积,通常能获得更优结果。然而,如何为这一损失函数生成合理的真实分布仍未得到充分探索。现有方法假设所有像素的真实分布均为围绕真值的单峰分布,这忽略了边缘像素可能具有多峰分布的事实。本文首先通过实验证明了正确边缘监督对整体视差精度的重要性。随后提出一种新颖的自适应多峰交叉熵损失,该损失鼓励网络为边缘像素和非边缘像素生成不同的分布模式。我们进一步优化了推理阶段的视差估计器,以缓解边缘处的拖尾和错位伪影。该方法具有通用性,能帮助经典立体匹配模型重获竞争力。采用本文损失训练的GANet在KITTI 2015和2012基准测试中位列第一,并以显著优势超越现有最优方法。同时,我们的方法展现出优异的跨领域泛化能力,在四个主流真实世界数据集上的表现优于现有专用泛化方法。