Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called Synchronized and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronized Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.
翻译:使用图卷积网络(GCN)的骨架动作识别已取得显著性能,但识别模糊动作(如“挥手”与“敬礼”)仍面临重大挑战。现有方法通常依赖GCN与时间卷积网络(TCN)的串行组合,其中空间与时间特征被独立提取,导致时空信息失衡,从而阻碍了准确的动作识别。此外,现有模糊动作识别方法往往过度强调局部细节,造成关键全局上下文的丢失,进一步增加了区分模糊动作的难度。为应对这些挑战,我们提出一种轻量级即插即用模块——同步细粒度头部模块(SF-Head),将其嵌入GCN与TCN层之间。SF-Head首先通过特征冗余损失(F-RL)进行同步时空特征提取(SSTE),确保两类特征间的平衡交互;随后执行自适应跨维度特征聚合(AC-FA),并引入特征一致性损失(F-CL),使聚合特征与原始时空特征对齐。该聚合步骤有效融合了全局上下文与局部细节。在NTU RGB+D 60、NTU RGB+D 120及NW-UCLA数据集上的实验结果表明,本方法在区分模糊动作方面取得显著提升。代码已发布于https://github.com/HaoHuang2003/SFHead。