SODFormer: Streaming Object Detection with Transformer Using Events and Frames

DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.

翻译：摘要：兼具异步事件流与帧流两种互补传感模态的DAVIS相机已逐渐被用于应对目标检测中的重大挑战（如高速运动模糊与低光照条件）。然而，如何有效利用丰富的时序线索并融合两种异构视觉流仍是一项艰巨任务。为此，我们提出了一种基于Transformer的新型流式目标检测器，即SODFormer，它首次将事件与帧数据集成，以异步方式持续检测目标。在技术层面，我们首先构建了一个包含1080.1k个手工标注的大规模多模态神经形态目标检测数据集（即PKU-DAVIS-SOD）。接着，我们设计了一种时空Transformer架构，将目标检测建模为端到端的序列预测问题，其中新颖的时间Transformer模块利用来自两个视觉流的丰富时序线索来提升检测性能。最后，提出基于异步注意力机制的融合模块，集成两种异构传感模态并实现各端互补优势，该模块可随时查询定位目标，突破了基于同步帧融合策略的有限输出频率。实验结果表明，所提出的SODFormer显著优于四种现有最优方法及我们设计的八条基线。我们还展示了该统一框架在传统帧相机失效场景（如高速运动与低光照条件）中依然表现优异。数据集与代码已开源至https://github.com/dianzl/SODFormer。