基于骨架的时空相对Transformer网络ST-RTR用于人体动作识别 (Human Action Recognition (HAR) Using Skeleton-based Spatial Temporal Relative Transformer Network: ST-RTR)

Human Action Recognition (HAR) is an interesting research area in human-computer interaction used to monitor the activities of elderly and disabled individuals affected by physical and mental health. In the recent era, skeleton-based HAR has received much attention because skeleton data has shown that it can handle changes in striking, body size, camera views, and complex backgrounds. One key characteristic of ST-GCN is automatically learning spatial and temporal patterns from skeleton sequences. It has some limitations, as this method only works for short-range correlation due to its limited receptive field. Consequently, understanding human action requires long-range interconnection. To address this issue, we developed a spatial-temporal relative transformer ST-RTR model. The ST-RTR includes joint and relay nodes, which allow efficient communication and data transmission within the network. These nodes help to break the inherent spatial and temporal skeleton topologies, which enables the model to understand long-range human action better. Furthermore, we combine ST-RTR with a fusion model for further performance improvements. To assess the performance of the ST-RTR method, we conducted experiments on three skeleton-based HAR benchmarks: NTU RGB+D 60, NTU RGB+D 120, and UAV-Human. It boosted CS and CV by 2.11 % and 1.45% on NTU RGB+D 60, 1.25% and 1.05% on NTU RGB+D 120. On UAV-Human datasets, accuracy improved by 2.54%. The experimental outcomes explain that the proposed ST-RTR model significantly improves action recognition associated with the standard ST-GCN method.

翻译：人体动作识别（HAR）是人机交互领域一个重要的研究方向，常用于监测受身心健康问题影响的老年人与残疾人的日常活动。近年来，基于骨架的HAR方法受到广泛关注，因为骨架数据已被证明能够有效应对光照变化、体型差异、摄像机视角以及复杂背景带来的挑战。ST-GCN的一个关键特性是能够自动从骨架序列中学习空间与时间模式。但该方法存在局限性，由于其感受野有限，仅能捕捉短程关联。然而，理解人体动作往往需要长程关联信息。为解决这一问题，我们提出了一种时空相对Transformer模型ST-RTR。ST-RTR包含关节节点与中继节点，可实现网络内的高效通信与数据传输。这些节点有助于打破骨架固有的时空拓扑结构，使模型能够更好地理解长程人体动作。此外，我们将ST-RTR与融合模型结合以进一步提升性能。为评估ST-RTR方法的性能，我们在三个基于骨架的HAR基准数据集上进行了实验：NTU RGB+D 60、NTU RGB+D 120和UAV-Human。在NTU RGB+D 60数据集上，CS与CV指标分别提升了2.11%和1.45%；在NTU RGB+D 120数据集上分别提升了1.25%和1.05%；在UAV-Human数据集上准确率提升了2.54%。实验结果表明，所提出的ST-RTR模型相较于标准ST-GCN方法在动作识别性能上取得了显著提升。