Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.
翻译:与传统动作识别相比,基于事件的动作识别(EAR)具有高时间分辨率捕捉和隐私保护的优势。当前主流的EAR解决方案通常遵循两种范式:将非结构化的事件流投影为稠密的结构化事件帧并采用强大的帧专用网络,或采用轻量级的点专用网络直接处理稀疏的非结构化事件点。然而,这两种范式均忽视了一个根本问题:未能适应异步事件数据独特的稠密时间与稀疏空间特性。本文提出一种协同感知框架,即EventCrab,它巧妙地将适用于稠密事件帧的“轻量级”帧专用网络与适用于稀疏事件点的“重量级”点专用网络相结合,在准确性与效率之间取得平衡。此外,我们构建了一个联合的帧-文本-点表示空间,以桥接不同的事件帧与事件点。具体而言,为更好地挖掘异步事件点固有的独特时空关系,我们为“重量级”点专用嵌入设计了两种策略:i) 类脉冲上下文学习器,从原始事件流中提取上下文化的事件点;ii) 事件点编码器,以希尔伯特扫描方式进一步探索事件点的长程时空特征。在四个数据集上的实验证明了我们提出的EventCrab具有显著性能,尤其在SeAct和HARDVS数据集上分别取得了5.17%和7.01%的性能提升。