基于注意力机制与预训练特征提取器的健身动作识别增强 (Enhancing Fitness Movement Recognition with Attention Mechanism and Pre-Trained Feature Extractors)

Fitness movement recognition, a focused subdomain of human activity recognition (HAR), plays a vital role in health monitoring, rehabilitation, and personalized fitness training by enabling automated exercise classification from video data. However, many existing deep learning approaches rely on computationally intensive 3D models, limiting their feasibility in real-time or resource-constrained settings. In this paper, we present a lightweight and effective framework that integrates pre-trained 2D Convolutional Neural Networks (CNNs) such as ResNet50, EfficientNet, and Vision Transformers (ViT) with a Long Short-Term Memory (LSTM) network enhanced by spatial attention. These models efficiently extract spatial features while the LSTM captures temporal dependencies, and the attention mechanism emphasizes informative segments. We evaluate the framework on a curated subset of the UCF101 dataset, achieving a peak accuracy of 93.34\% with the ResNet50-based configuration. Comparative results demonstrate the superiority of our approach over several state-of-the-art HAR systems. The proposed method offers a scalable and real-time-capable solution for fitness activity recognition with broader applications in vision-based health and activity monitoring.

翻译：健身动作识别作为人体活动识别（HAR）的一个重点子领域，通过从视频数据实现自动化运动分类，在健康监测、康复训练和个性化健身指导中发挥着至关重要的作用。然而，现有许多深度学习方法依赖于计算密集的3D模型，这限制了其在实时或资源受限场景下的可行性。本文提出一种轻量且高效的框架，该框架整合了预训练的二维卷积神经网络（CNN）（如ResNet50、EfficientNet）与视觉Transformer（ViT），并通过空间注意力增强的长短期记忆（LSTM）网络进行时序建模。这些模型高效提取空间特征，LSTM捕捉时序依赖关系，而注意力机制则强化信息丰富的片段。我们在UCF101数据集的精选子集上评估该框架，基于ResNet50的配置取得了93.34%的最高准确率。对比实验结果表明，该方法优于多种先进的HAR系统。所提出的方案为健身活动识别提供了一个可扩展且具备实时处理能力的解决方案，在基于视觉的健康与活动监测领域具有广泛的应用前景。