基于跨注意力机制的体-手模态专家化网络在细粒度骨骼动作识别中的应用 (Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition)

Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.

翻译：基于骨骼的人体动作识别是机器人学和人机交互领域的一项关键技术。然而，现有方法大多集中于全身运动，往往忽略了对于区分细粒度动作至关重要的细微手部动作。近期研究采用了一种结合身体、手部和足部关键点的统一图表示来捕捉详细的身体动态。然而，由于身体与手部动作特征的差异性，以及在空间池化过程中细微特征的丢失，这些模型常常模糊了手部的精细细节。本文提出BHaRNet（体-手动作识别网络），一种新颖的框架，它通过一个手部专家模型增强了一个典型的身体专家模型。我们的模型通过一种促进协作专业化的集成损失联合训练两个流，其运作方式类似于混合专家模型。此外，通过专家化分支方法和池化-注意力模块采用跨注意力机制，以实现特征级交互并有选择地融合互补信息。受MMNet启发，我们还通过利用RGB信息展示了我们方法在多模态任务中的适用性，其中身体特征引导RGB学习以捕捉更丰富的上下文线索。在大规模基准数据集（NTU RGB+D 60、NTU RGB+D 120、PKU-MMD和Northwestern-UCLA）上的实验表明，BHaRNet实现了最先进的准确率——在手部密集动作中从86.4%提升至93.0%——同时保持了比相关统一方法更少的GFLOPs和参数量。