Multimodal Action Quality Assessment

Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.

翻译：动作质量评估（Action Quality Assessment, AQA）旨在评判动作执行的优劣程度。以往的研究仅利用视觉信息进行建模，忽略了音频信息。我们认为，尽管AQA高度依赖视觉信息，但音频作为互补信息，有助于提升评分回归的准确性，尤其在具有背景音乐的运动项目（如花样滑冰、艺术体操）中表现突出。为利用多模态信息（即RGB、光流和音频）进行AQA，我们提出了一种渐进式自适应多模态融合网络（Progressive Adaptive Multimodal Fusion Network, PAMFN），该网络分别对模态特定信息与混合模态信息进行建模。模型包含三个模态特定分支，独立探索各模态特定信息，以及一个混合模态分支，逐步聚合来自模态特定分支的信息。为构建模态特定分支与混合模态分支之间的桥梁，我们提出了三个新模块。首先，设计了模态特定特征解码器（Modality-specific Feature Decoder），用以选择性转移模态特定信息至混合模态分支。其次，在探索模态特定信息交互时，我们认为采用不变的多模态融合策略可能导致次优结果，因此需考虑动作不同部分的潜在差异性。为此，提出自适应融合模块（Adaptive Fusion Module），学习动作不同部分的动态多模态融合策略。该模块包含多个FusionNet以探索不同融合策略，以及一个PolicyNet决定启用哪些FusionNet。第三，设计跨模态特征解码器（Cross-modal Feature Decoder），将自适应融合模块生成的跨模态特征转移至混合模态分支。