HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating

Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.

翻译：事件驱动动作识别（EAR）因其事件相机的高时间分辨率与高动态范围特性而备受关注。然而，现有方法通常存在以下问题：（i）密集体素表示的计算冗余；（ii）多分支架构固有的结构冗余；以及（iii）在捕捉全局运动模式时对频谱信息的利用不足。为应对这些挑战，我们提出了一种名为 HoloEv-Net 的高效 EAR 框架。首先，为同时解决表示与结构冗余，我们引入了紧凑全息时空表示（CHSR）。与计算代价高昂的体素网格不同，CHSR 将水平空间线索隐式嵌入到时间-高度（T-H）视图中，从而在二维表示中有效保留了三维时空上下文。其次，为利用被忽视的频谱线索，我们设计了全局谱门控（GSG）模块。该模块利用快速傅里叶变换（FFT）在频域进行全局令牌混合，以可忽略的参数开销增强了表示能力。大量实验证明了我们框架的可扩展性与有效性。具体而言，HoloEv-Net-Base 在 THU-EACT-50-CHL、HARDVS 和 DailyDVS-200 数据集上均取得了最先进的性能，分别以 10.29%、1.71% 和 6.25% 的优势超越现有方法。此外，我们的轻量级变体 HoloEv-Net-Small 在保持极具竞争力的精度的同时，实现了极高的效率：与重型基线模型相比，其参数量减少了 5.4 倍，FLOPs 降低了 300 倍，延迟降低了 2.4 倍，展现了其在边缘部署方面的潜力。