Multimodal Action Quality Assessment

Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.

翻译：动作质量评估旨在评判动作执行的优劣程度。以往研究仅利用视觉信息进行建模，忽视了音频信息的作用。我们认为，尽管动作质量评估高度依赖视觉信息，但音频作为补充信息能够有效提升评分回归的准确性，特别是在花样滑冰、艺术体操等具有背景音乐的体育项目中。为综合利用RGB、光流与音频等多模态信息进行动作质量评估，我们提出了一种渐进式自适应多模态融合网络。该网络分别对模态特定信息与混合模态信息进行建模，包含三个独立探索模态特定信息的专用分支，以及一个渐进聚合各模态特定信息的混合模态分支。为构建模态特定分支与混合模态分支间的桥梁，我们提出了三个创新模块：首先，设计模态特定特征解码器模块，用于选择性地将模态特定信息传递至混合模态分支；其次，在探索模态特定信息间的交互关系时，我们认为采用固定的多模态融合策略可能导致次优结果，因此提出自适应融合模块以学习动作不同阶段的自适应多模态融合策略。该模块包含若干探索不同多模态融合策略的FusionNet，以及决策启用哪些FusionNet的PolicyNet；最后，设计跨模态特征解码器模块，将自适应融合模块生成的跨模态特征传递至混合模态分支。