Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
翻译:基于骨架的人体动作识别(HAR)借助图神经网络架构已取得显著进展。然而,现有方法大多以身体为中心,专注于大尺度运动,却忽略了对手部细微关节动作的建模——而这正是实现细粒度识别的关键。本研究提出一种概率双流框架,该框架统一了可靠性建模与多模态集成,将专家化学习推广至骨架内部与跨模态领域的不确定性条件下。该框架包含三个核心组件:(1)无需校准的预处理流程,通过消除规范空间变换直接从原生坐标学习;(2)基于概率Noisy-OR的融合机制,在无需显式置信度监督的情况下实现稳定的可靠性感知双流学习;(3)从内部到跨模态的集成策略,将四种骨架模态(关节、骨骼、关节运动、骨骼运动)与RGB表征耦合,在统一的跨模态框架中融合结构与视觉运动线索。在多个基准数据集(NTU RGB+D~60/120、PKU-MMD、N-UCLA)及新定义的手部中心基准上的综合评估表明,该方法在噪声和异构条件下均表现出持续的性能提升与鲁棒性。