基于裁剪骨架视频的缝合对比与分割：一种人体动作分割模型的学习方法 (Stitch Contrast and Segment_Learning a Human Action Segmentation Model Using Trimmed Skeleton Videos)

Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.

翻译：现有基于骨架的人体动作分类模型依赖于经过良好裁剪的特定动作骨架视频进行训练和测试，这限制了其在现实应用中的可扩展性，因为现实场景中普遍存在包含连续动作的未裁剪视频。为克服这一局限性，近期提出的骨架动作分割模型将未裁剪骨架视频纳入端到端训练。该模型经过优化，可为任意长度的测试视频提供逐帧预测，同时实现动作定位与分类。然而，实现这一改进需要逐帧标注的骨架视频，这在实践中仍耗时费力。本文提出一种基于骨架的动作分割新框架，该框架使用短裁剪骨架视频进行训练，但可在更长的未裁剪视频上运行。该方法通过三个步骤实现：缝合、对比与分割。首先，缝合步骤提出一种时序骨架缝合方案，将裁剪骨架视频视为构成语义空间的基本人体运动单元，可通过采样生成多动作缝合序列。对比步骤通过新颖的判别性前置任务从缝合序列中学习对比表征，使骨架编码器能够学习有意义的动作-时序上下文以改进动作分割。最后，分割步骤通过在学习分割层的同时处理特定数据可用性，将所提方法与动作分割任务关联。实验采用适应式框架，在裁剪源数据集和未裁剪目标数据集上评估所提方法在现实场景中基于骨架的人体动作分割的有效性。