Self-supervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations. To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.
翻译:自监督单目深度估计在自动驾驶和机器人等领域具有重要应用价值。然而,自监督机制对静态场景的强假设导致其在动态场景中难以达到最优性能,而动态场景恰恰是现实世界中的常见情况。为解决上述问题,我们提出PPEA-Depth,一种渐进式参数高效适配方法,用于将预训练图像模型迁移至自监督深度估计任务。该训练包含两个连续阶段:首先在主要由静态场景构成的数据集上进行初始训练,随后扩展至包含动态场景的复杂数据集。为促进该过程,我们设计了紧凑型编码器与解码器适配模块,通过参数高效微调使网络实现有效适配。这些模块不仅能保持预训练图像模型的通用模式,还能将前一阶段学到的知识延续至后续阶段。大量实验表明,PPEA-Depth在KITTI、CityScapes和DDAD数据集上均达到了当前最优性能。