In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.
翻译:本文研究了一个新问题——叙事动作评估(NAE)。NAE旨在生成评估动作执行质量的专业评述。与基于分数的动作质量评估、视频字幕生成等传统任务不同(这些任务仅涉及浅层语句),NAE侧重于生成包含自然语言的详细叙事文本,这些文本既包含对动作的细致描述,也提供客观评价。由于NAE需要兼顾叙事的灵活性与评估的严谨性,其挑战性更高。现有的一种可行方案是采用多任务学习,分别预测叙事语言和评估信息。然而,这种做法因任务间差异以及语言信息与评估信息的模态差异,导致各子任务性能下降。为解决这一问题,我们提出了一种基于提示引导的多模态交互框架。该框架利用一对变压器网络促进不同模态信息间的交互,并通过提示将分数回归任务转化为视频-文本匹配任务,从而实现任务交互性。为支撑该领域的后续研究,我们基于MTL-AQA和FineGym数据集重新标注了高质量、全面的动作叙事实例,并为NAE建立了基准测试。大量实验结果表明,我们的方法优于单独学习方法与朴素的多任务学习方法。数据和代码已发布于 https://github.com/shiyi-zh0408/NAE_CVPR2024。