Stereo matching plays a crucial role in 3D perception and scenario understanding. Despite the proliferation of promising methods, addressing texture-less and texture-repetitive conditions remains challenging due to the insufficient availability of rich geometric and semantic information. In this paper, we propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios. Specifically, we introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture. Subsequently, the disparity discrepancy volume undergoes hierarchical filtering through the incorporation of depth-aware hierarchy attention and target-aware disparity attention modules. Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation. Furthermore, we propose a more rigorous evaluation metric that considers depth-wise relative error, providing comprehensive evaluations for universal stereo matching and depth estimation models. We extensively validate the superiority of our proposed methods on public datasets. Results demonstrate that our model achieves state-of-the-art performance, particularly excelling in scenarios with texture-less images. The code is available at https://github.com/ztsrxh/DVANet.
翻译:立体匹配在三维感知和场景理解中起着关键作用。尽管已有众多有效方法被提出,但在处理无纹理及纹理重复场景时,由于缺乏丰富的几何与语义信息,仍然面临挑战。本文提出了一种轻量级的视差优化方案,以应对实际户外场景中的纹理退化问题。具体而言,我们引入了一种受真实深度图监督的深度体素,捕捉图像纹理的相对层次结构。随后,通过融合深度感知层次注意力与目标感知视差注意力模块,对不一致视差体素进行分层滤波。在体素聚合过程中,局部精细结构与上下文信息被重点强化,以减少歧义与冗余。此外,我们提出了一种更严格的评估指标,综合考虑深度方向上的相对误差,为通用立体匹配与深度估计模型提供全面的评估。我们在公开数据集上广泛验证了所提方法的优越性。实验结果表明,我们的模型达到了最先进的性能,尤其在无纹理图像场景中表现突出。代码已开源:https://github.com/ztsrxh/DVANet。