In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part facilitates information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method surpasses the state-of-the-art by an impressive margin of $\textbf{8.0}\%$ on the DSEC dataset. Besides, our method exhibits significantly better robustness (\textbf{69.5}\% versus \textbf{38.7}\%) when introducing 15 different corruption types to the frame images. The code can be found at the link (https://github.com/HuCaoFighting/FRN).
翻译:在基于帧图像的视觉系统中,由于传统相机感知能力有限,目标检测在挑战性条件下面临显著的性能下降。事件相机输出稀疏且异步的事件,为解决这些问题提供了潜在方案。然而,如何有效融合这两种异构模态仍是一个开放性问题。在本工作中,我们提出了一种新颖的用于事件-帧融合的分层特征精炼网络。其核心设计是粗到细的融合模块,称为跨模态自适应特征精炼(CAFR)模块。在初始阶段,双向跨模态交互(BCI)部分促进来自两个不同源的信息桥接。随后,通过在两阶段自适应特征精炼(TAFR)部分对齐通道级均值和方差,特征得到进一步精炼。我们在两个基准数据集上进行了广泛实验:低分辨率PKU-DDD17-Car数据集和高分辨率DSEC数据集。实验结果表明,我们的方法在DSEC数据集上以显著优势超越了现有最佳方法,性能提升达$\textbf{8.0}\%$。此外,当对帧图像引入15种不同类型的损坏时,我们的方法展现出明显更好的鲁棒性(\textbf{69.5}\%对比\textbf{38.7}\%)。代码可通过链接(https://github.com/HuCaoFighting/FRN)获取。