EvConv: Fast CNN Inference on Event Camera Inputs For High-Speed Robot Perception

Event cameras capture visual information with a high temporal resolution and a wide dynamic range. This enables capturing visual information at fine time granularities (e.g., microseconds) in rapidly changing environments. This makes event cameras highly useful for high-speed robotics tasks involving rapid motion, such as high-speed perception, object tracking, and control. However, convolutional neural network inference on event camera streams cannot currently perform real-time inference at the high speeds at which event cameras operate - current CNN inference times are typically closer in order of magnitude to the frame rates of regular frame-based cameras. Real-time inference at event camera rates is necessary to fully leverage the high frequency and high temporal resolution that event cameras offer. This paper presents EvConv, a new approach to enable fast inference on CNNs for inputs from event cameras. We observe that consecutive inputs to the CNN from an event camera have only small differences between them. Thus, we propose to perform inference on the difference between consecutive input tensors, or the increment. This enables a significant reduction in the number of floating-point operations required (and thus the inference latency) because increments are very sparse. We design EvConv to leverage the irregular sparsity in increments from event cameras and to retain the sparsity of these increments across all layers of the network. We demonstrate a reduction in the number of floating operations required in the forward pass by up to 98%. We also demonstrate a speedup of up to 1.6X for inference using CNNs for tasks such as depth estimation, object recognition, and optical flow estimation, with almost no loss in accuracy.

翻译：事件相机以高时间分辨率和大动态范围捕捉视觉信息，使其能够在快速变化环境中以微秒级时间粒度获取视觉数据，这对涉及高速运动的机器人任务（如高速感知、目标跟踪与控制）具有重要价值。然而，当前基于卷积神经网络的事件相机流推理无法达到事件相机本身的高速实时处理水平——CNN推理时间通常接近传统帧相机的帧率量级。为实现事件相机高频率与高时间分辨率的充分利用，必须达到事件相机速率的实时推理。本文提出EvConv方法，通过优化事件相机输入实现CNN快速推理。我们观察到事件相机连续输入间的差异极小，因此提出对连续输入张量的差分（即增量）进行推理。由于增量具有高度稀疏性，该方法可显著减少所需浮点运算次数（进而降低推理延迟）。EvConv的设计可充分利用事件相机增量中非规则稀疏特性，并在网络各层保持增量稀疏性。实验表明，前向传播所需浮点运算量最高可降低98%，在深度估计、目标识别与光流估计等任务中，CNN推理速度最高提升1.6倍且几乎无精度损失。