Cross-Block Fine-Grained Semantic Cascade for Skeleton-Based Sports Action Recognition

Human action video recognition has recently attracted more attention in applications such as video security and sports posture correction. Popular solutions, including graph convolutional networks (GCNs) that model the human skeleton as a spatiotemporal graph, have proven very effective. GCNs-based methods with stacked blocks usually utilize top-layer semantics for classification/annotation purposes. Although the global features learned through the procedure are suitable for the general classification, they have difficulty capturing fine-grained action change across adjacent frames -- decisive factors in sports actions. In this paper, we propose a novel ``Cross-block Fine-grained Semantic Cascade (CFSC)'' module to overcome this challenge. In summary, the proposed CFSC progressively integrates shallow visual knowledge into high-level blocks to allow networks to focus on action details. In particular, the CFSC module utilizes the GCN feature maps produced at different levels, as well as aggregated features from proceeding levels to consolidate fine-grained features. In addition, a dedicated temporal convolution is applied at each level to learn short-term temporal features, which will be carried over from shallow to deep layers to maximize the leverage of low-level details. This cross-block feature aggregation methodology, capable of mitigating the loss of fine-grained information, has resulted in improved performance. Last, FD-7, a new action recognition dataset for fencing sports, was collected and will be made publicly available. Experimental results and empirical analysis on public benchmarks (FSD-10) and self-collected (FD-7) demonstrate the advantage of our CFSC module on learning discriminative patterns for action classification over others.

翻译：人体动作视频识别近年来在视频安防、运动姿态矫正等应用中受到更多关注。主流解决方案（包括将人体骨架建模为时空图的图卷积网络）已被证明非常有效。基于堆叠块结构的图卷积方法通常利用顶层语义进行分类/标注。虽然通过该过程学习的全局特征适用于一般分类，但难以捕捉相邻帧间的细粒度动作变化——这是运动动作中的决定性因素。本文提出名为"跨块细粒度语义级联"的创新模块以应对该挑战。该模块通过渐进式地将浅层视觉知识整合到高层模块，使网络聚焦于动作细节。具体而言，CFSC模块利用不同层级生成的图卷积特征图及前序层级的聚合特征来强化细粒度特征。此外，每个层级应用专用时序卷积学习短期时序特征，这些特征将从浅层传递至深层以最大限度利用底层细节。这种跨块特征聚合方法能够缓解细粒度信息丢失，从而提升性能。最后，我们收集并公开了击剑运动动作识别新数据集FD-7。在公开基准数据集FSD-10及自建数据集FD-7上的实验结果与实证分析表明，我们的CFSC模块在学习动作分类判别性模式方面优于其他方法。