The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.
翻译:本文旨在构建一个自动音频描述(AD)模型,该模型能够接收电影输入并输出文本形式的音频描述。生成高质量的电影音频描述面临两大挑战:描述内容对上下文的依赖,以及可用训练数据的稀缺性。本工作中,我们利用了预训练基础模型(如GPT和CLIP)的强大能力,仅需训练一个连接这两类模型的映射网络,即可实现基于视觉条件的文本生成。为获取高质量音频描述,我们做出了以下四项贡献:(i)整合了电影片段上下文、先前片段的音频描述以及字幕信息;(ii)通过在大规模数据集(如无电影画面的纯文本音频描述或无上下文的视觉描述数据集)上进行预训练,解决了训练数据不足的问题;(iii)改进了现有音频描述数据集,消除了MAD数据集中的标签噪声,并补充了角色命名信息;(iv)在电影音频描述任务上取得了优于先前方法的显著成果。