Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.
翻译:动作质量评估(AQA)旨在自动且公平地评估运动表现,近年来受到越来越多的关注。然而,运动员通常处于快速运动中,相应的视觉外观变化细微,这使得捕捉细粒度的姿态差异具有挑战性,并导致估计性能不佳。此外,大多数常见的AQA任务,例如体育中的跳水,通常被划分为多个子动作,每个子动作包含不同的持续时间。然而,现有方法侧重于将视频分割为固定帧,这破坏了子动作的时间连续性,导致不可避免的预测误差。为了解决这些挑战,我们提出了一种新颖的基于层次化姿态引导的多阶段对比回归的动作质量评估方法。首先,我们引入了一个多尺度动态视觉-骨架编码器来捕捉细粒度的时空视觉和骨骼特征。然后,引入一个过程分割网络来分离不同的子动作并获取分割后的特征。之后,分割后的视觉和骨骼特征均作为物理结构先验输入到一个多模态融合模块中,以指导模型学习精细化的活动相似性和差异性。最后,采用多阶段对比学习回归方法来学习判别性表示并输出预测结果。此外,我们引入了一个新标注的FineDiving-Pose数据集,以改进当前低质量的人体姿态标签。在实验中,FineDiving和MTL-AQA数据集上的结果证明了我们提出方法的有效性和优越性。我们的源代码和数据集可在 https://github.com/Lumos0507/HP-MCoRe 获取。