Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: https://github.com/liujf69/ICMEW2024-Track10.
翻译:基于骨架的动作识别因其利用简洁且鲁棒的骨架表示而受到广泛关注。然而,当前方法往往倾向于采用单一骨干网络对骨架模态进行建模,这容易受限于网络骨干自身固有的缺陷。为克服这一问题并充分利用不同网络架构的互补特性,我们提出了一种新颖的混合双分支网络(HDBN),专用于鲁棒的骨架动作识别。该网络融合了图卷积网络在处理图结构数据方面的专长,以及Transformer在全局信息建模上的强大能力。具体而言,我们提出的HDBN包含两个主干分支:MixGCN和MixFormer。这两个分支分别利用GCN和Transformer对2D和3D骨架模态进行建模。在2024年ICME国际大挑战赛的多模态视频推理与分析竞赛(MMVRAC)中,所提出的HDBN脱颖而出,成为顶尖解决方案之一,在UAV-Human数据集的两个基准上分别取得47.95%和75.36%的准确率,超越了现有大多数方法。我们的代码将开源在:https://github.com/liujf69/ICMEW2024-Track10。