Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Compared with frame-based sensors, event cameras have microsecond-level latency and high dynamic range, hence showing great potential for object detection under high-speed motion and poor illumination conditions. Due to sparsity and asynchronism nature with event streams, most of existing approaches resort to hand-crafted methods to convert event data into 2D grid representation. However, they are sub-optimal in aggregating information from event stream for object detection. In this work, we propose to learn an event representation optimized for event-based object detection. Specifically, event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. To fully exploit information with event streams to detect objects, a dual-memory aggregation network (DMANet) is proposed to leverage both long and short memory along event streams to aggregate effective information for object detection. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars at neighboring time intervals. Extensive experiments on the recently released event-based automotive detection dataset demonstrate the effectiveness of the proposed method.
翻译:事件相机是一种受生物启发的传感器,能够以异步方式捕捉每个像素的亮度变化。与基于帧的传感器相比,事件相机具有微秒级延迟和高动态范围,因此在高速运动和弱光照条件下的目标检测中展现出巨大潜力。由于事件流具有稀疏性和异步性,现有方法大多采用手工方式将事件数据转换为2D网格表示。然而,这些方法在从事件流中聚合信息进行目标检测方面并非最优。本文提出学习一种针对事件目标检测优化的事件表示。具体地,将事件流在x-y-t坐标系中按正负极性划分为网格,生成一组柱状体作为3D张量表示。为充分挖掘事件流中的信息进行目标检测,我们提出一种双记忆聚合网络(DMANet),该网络利用事件流的长短记忆来聚合有效检测信息。其中,长记忆通过自适应卷积长短时记忆网络的隐藏状态编码,短记忆则通过计算相邻时间间隔事件柱状体之间的时空相关性进行建模。在最新发布的事件型自动驾驶检测数据集上的大量实验证明了该方法的有效性。