SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.

翻译：单目深度估计是一项关键任务，用于测量相对于相机的距离，对机器人导航和自动驾驶等应用至关重要。传统的基于帧的方法因动态范围有限和运动模糊而导致性能下降。因此，近期研究利用新型事件相机，通过帧-事件特征融合来补充或引导帧模态。然而，事件流存在空间稀疏性，导致某些区域（尤其是光照变化微小的区域）未被感知。直接融合方法（如RAMNet）往往忽略了各模态最可靠区域的贡献，导致模态融合过程中的结构模糊性，从而降低深度估计性能。本文提出一种新型基于空间可靠性的融合网络（SRFNet），能够在白天和夜晚场景下估计具有精细结构的深度。我们的方法包含两个关键技术组件：首先，提出基于注意力的交互式融合（AIF）模块，利用事件和帧的空间先验作为初始掩码，学习共识区域以指导模态间特征融合；融合后的特征被反馈用于增强帧和事件的特征学习，同时通过输出头生成融合掩码，并通过迭代更新学习共识性空间先验。其次，提出面向可靠性的深度细化（RDR）模块，基于融合特征和掩码估计具有精细结构的稠密深度。我们在合成和真实数据集上验证了方法的有效性，结果表明，即使无需预训练，我们的方法仍优于先前方法（如RAMNet），尤其在夜间场景中。项目主页：https://vlislab22.github.io/SRFNet。