Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
翻译:电影音频描述生成是一项具有挑战性的任务,需要细粒度的视觉理解以及对角色及其名称的感知。目前,用于音频描述生成的视觉语言模型受限于缺乏合适的训练数据,同时其评估也因未采用专门针对音频描述领域的性能指标而受到阻碍。本文提出三项贡献:(i)提出两种构建带对齐视频数据的音频描述数据集的方法,并利用这些方法构建训练和评估数据集。这些数据集将公开发布;(ii)开发一种基于Q-former的架构,该架构可输入原始视频并生成音频描述,利用冻结的预训练视觉编码器和大语言模型;(iii)提供新的评估指标来基准化音频描述质量,这些指标与人类表现高度匹配。综合以上工作,我们在音频描述生成领域超越了当前最先进技术。