In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information. To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. The benchmark, code, and models will be released at https://github.com/KPeng9510/OS-SAR.
翻译:在现实场景中,人体动作常常超出训练数据的分布范围,这要求模型既能识别已知动作,又能拒绝未知动作。然而,在开放集条件下使用纯骨架数据面临诸多挑战:缺乏视觉背景线索,且人体姿态序列具有独特的稀疏结构。本文首次探索了开放集骨架动作识别任务,并在三个骨架数据集上建立了标准化基准。我们评估了七种现有开放集方法在该任务上的表现,揭示了它们在处理骨架信息时的局限性及关键的泛化问题。为应对这些挑战,我们提出了一种基于距离的跨模态集成方法,通过利用骨架关节点、骨骼和速度的跨模态对齐,实现了优越的开放集识别性能。该核心思想被称为CrossMax——该方法在训练阶段采用新颖的跨模态均值最大差异抑制机制来对齐潜在空间,在测试阶段采用基于跨模态距离的逻辑值精炼方法。CrossMax在所有数据集和骨干网络上均超越现有方法,持续取得最先进成果。相关基准、代码和模型将发布于https://github.com/KPeng9510/OS-SAR。