Self-supervised multi-frame monocular depth estimation relies on the geometric consistency between successive frames under the assumption of a static scene. However, the presence of moving objects in dynamic scenes introduces inevitable inconsistencies, causing misaligned multi-frame feature matching and misleading self-supervision during training. In this paper, we propose a novel framework called ProDepth, which effectively addresses the mismatch problem caused by dynamic objects using a probabilistic approach. We initially deduce the uncertainty associated with static scene assumption by adopting an auxiliary decoder. This decoder analyzes inconsistencies embedded in the cost volume, inferring the probability of areas being dynamic. We then directly rectify the erroneous cost volume for dynamic areas through a Probabilistic Cost Volume Modulation (PCVM) module. Specifically, we derive probability distributions of depth candidates from both single-frame and multi-frame cues, modulating the cost volume by adaptively fusing those distributions based on the inferred uncertainty. Additionally, we present a self-supervision loss reweighting strategy that not only masks out incorrect supervision with high uncertainty but also mitigates the risks in remaining possible dynamic areas in accordance with the probability. Our proposed method excels over state-of-the-art approaches in all metrics on both Cityscapes and KITTI datasets, and demonstrates superior generalization ability on the Waymo Open dataset.
翻译:自监督多帧单目深度估计依赖于静态场景假设下连续帧间的几何一致性。然而,动态场景中运动物体的存在会引入不可避免的不一致性,导致训练过程中多帧特征匹配失准并产生误导性自监督信号。本文提出名为ProDepth的新型框架,通过概率方法有效解决动态物体导致的匹配失配问题。我们首先通过辅助解码器推导静态场景假设相关的不确定性,该解码器分析代价体积中嵌入的不一致性,推断区域属于动态的概率。随后通过概率代价体积调制模块直接修正动态区域的错误代价体积。具体而言,我们从单帧线索与多帧线索分别推导深度候选值的概率分布,并基于推断的不确定性自适应融合这些分布以调制代价体积。此外,我们提出自监督损失重加权策略,该策略不仅通过高不确定性掩蔽错误监督信号,还能依据概率值缓解剩余潜在动态区域的风险。我们提出的方法在Cityscapes和KITTI数据集的所有指标上均优于现有最优方法,并在Waymo Open数据集上展现出卓越的泛化能力。