Transformer-based human skeleton action recognition has been developed for years. However, the complexity and high parameter count demands of these models hinder their practical applications, especially in resource-constrained environments. In this work, we propose FreqMixForemrV2, which was built upon the Frequency-aware Mixed Transformer (FreqMixFormer) for identifying subtle and discriminative actions with pioneered frequency-domain analysis. We design a lightweight architecture that maintains robust performance while significantly reducing the model complexity. This is achieved through a redesigned frequency operator that optimizes high-frequency and low-frequency parameter adjustments, and a simplified frequency-aware attention module. These improvements result in a substantial reduction in model parameters, enabling efficient deployment with only a minimal sacrifice in accuracy. Comprehensive evaluations of standard datasets (NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets) demonstrate that the proposed model achieves a superior balance between efficiency and accuracy, outperforming state-of-the-art methods with only 60% of the parameters.
翻译:基于Transformer的人体骨骼动作识别技术已发展多年。然而,这些模型的复杂性和高参数量需求阻碍了其实际应用,特别是在资源受限的环境中。本文提出FreqMixFormerV2,该模型基于频率感知混合Transformer(FreqMixFormer)构建,通过开创性的频域分析来识别细微且具有区分性的动作。我们设计了一种轻量级架构,在显著降低模型复杂度的同时保持了鲁棒的性能。这是通过重新设计的频率算子(优化了高频和低频参数调整)以及简化的频率感知注意力模块实现的。这些改进使得模型参数量大幅减少,从而能够以仅微小的精度损失实现高效部署。在标准数据集(NTU RGB+D、NTU RGB+D 120和NW-UCLA数据集)上的综合评估表明,所提出的模型在效率与精度之间取得了优越的平衡,仅用60%的参数量即超越了现有最先进方法的性能。