Event cameras detect changes in per-pixel intensity to generate asynchronous `event streams'. They offer great potential for accurate semantic map retrieval in real-time autonomous systems owing to their much higher temporal resolution and high dynamic range (HDR) compared to conventional cameras. However, existing implementations for event-based segmentation suffer from sub-optimal performance since these temporally dense events only measure the varying component of a visual signal, limiting their ability to encode dense spatial context compared to frames. To address this issue, we propose a hybrid end-to-end learning framework HALSIE, utilizing three key concepts to reduce inference cost by up to $20\times$ versus prior art while retaining similar performance: First, a simple and efficient cross-domain learning scheme to extract complementary spatio-temporal embeddings from both frames and events. Second, a specially designed dual-encoder scheme with Spiking Neural Network (SNN) and Artificial Neural Network (ANN) branches to minimize latency while retaining cross-domain feature aggregation. Third, a multi-scale cue mixer to model rich representations of the fused embeddings. These qualities of HALSIE allow for a very lightweight architecture achieving state-of-the-art segmentation performance on DDD-17, MVSEC, and DSEC-Semantic datasets with up to $33\times$ higher parameter efficiency and favorable inference cost (17.9mJ per cycle). Our ablation study also brings new insights into effective design choices that can prove beneficial for research across other vision tasks.
翻译:摘要:事件相机通过检测逐像素强度变化生成异步“事件流”。与传统相机相比,其凭借更高的时间分辨率和高动态范围(HDR),在实时自主系统的精确语义地图获取方面展现出巨大潜力。然而,现有基于事件的分割实现性能欠佳,因为这类时间密集的事件仅测量视觉信号的动态分量,在编码稠密空间上下文方面弱于帧图像。针对该问题,我们提出混合端到端学习框架HALSIE,通过三大关键概念在保持同等性能的同时将推理成本较先前技术降低高达20倍:其一,一种简洁高效的跨域学习方案,从帧图像与事件中提取互补的时空嵌入;其二,一种采用脉冲神经网络(SNN)与人工神经网络(ANN)分支的特殊双编码器架构,在保持跨域特征聚合的同时最小化延迟;其三,一种多尺度线索混合器,用于对融合后的嵌入进行丰富的特征建模。HALSIE的这些特性使得超轻量级架构得以实现,在DDD-17、MVSEC和DSEC-Semantic数据集上取得最先进的分割性能,参数效率提升高达33倍,推理成本极具优势(每周期17.9mJ)。我们的消融研究还为其他视觉任务的有效设计选择提供了新见解,这些发现可能推动更广泛的研究发展。