Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D$^3$epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{https://github.com/Csyunling/D3epth}.
翻译:深度估计是机器人技术中的一项关键技术。近年来,自监督深度估计方法展现出巨大潜力,因为它们能够高效利用大量未标注的真实世界数据。然而,现有方法大多基于静态场景假设设计,这限制了其在动态环境中的适应性。为解决这一问题,我们提出了D$^3$epth,一种面向动态场景的自监督深度估计新方法。该方法从两个关键角度应对动态物体带来的挑战。首先,在自监督框架内,我们设计了一种重投影约束来识别可能包含动态物体的区域,从而构建动态掩码以在损失层面减轻其影响。其次,对于多帧深度估计,我们引入了一种代价体积自动掩码策略,利用相邻帧来识别与动态物体相关的区域并生成相应掩码,为后续处理提供指导。此外,我们提出了一个谱熵不确定性模块,该模块引入谱熵来指导深度融合过程中的不确定性估计,有效解决了动态环境下代价体积计算所引发的问题。在KITTI和Cityscapes数据集上的大量实验表明,所提方法在各项指标上均持续优于现有的自监督单目深度估计基线方法。代码公开于 \url{https://github.com/Csyunling/D3epth}。