Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.
翻译:在边缘设备上进行动作识别对延迟、内存、存储和功耗提出了严格限制。虽然辅助模态(如骨架和深度信息)可以提升识别性能,但它们通常需要额外的传感器或计算成本高昂的姿态估计流程,这限制了其在边缘应用中的实用性。在本工作中,我们提出了一种专为高效设备端推理设计的紧凑型纯RGB网络。我们的方法基于一个融合了Temporal Shift的X3D风格骨干网络,并进一步引入了选择性时序适应和无参数注意力机制。在NTU RGB+D 60和120基准测试上的大量实验表明,该方法在准确性与效率之间取得了良好平衡。此外,在Jetson Orin Nano上的部署级性能分析验证了,与现有的基于RGB的动作识别技术相比,该方法具有更小的设备端占用空间和更实用的资源利用率。