We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
翻译:我们提出了循环视觉Transformer(RVTs),一种面向事件相机目标检测的新型骨干网络。事件相机能在亚毫秒级延迟下提供高动态范围的视觉信息,并对运动模糊具有很强的鲁棒性。这些独特特性为时间关键场景中的低延迟目标检测与跟踪带来了巨大潜力。先前基于事件的视觉研究工作虽已取得出色的检测性能,但代价是推理时间过长(通常超过40毫秒)。通过重新审视循环视觉骨干网络的高层设计,我们将推理时间降低了6倍,同时保持了相近的性能。为实现这一目标,我们探索了一种多阶段设计,每个阶段包含三个关键概念:第一,可被视为条件位置嵌入的卷积先验;第二,用于空间特征交互的局部与空洞全局自注意力;第三,用于最小化延迟同时保留时序信息的循环时序特征聚合。RVTs可从零开始训练,在基于事件的目标检测中达到当前最优性能——在Gen1自动驾驶数据集上实现了47.2%的mAP。与此同时,RVTs具备快速推理能力(在T4 GPU上<12毫秒)和优越的参数效率(比先前方法少5倍参数)。本研究揭示了有效的设计选择,可为超越事件视觉领域的研究带来新见解。