The Audio Description (AD) task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. To enhance the alignment of features across various modalities with finer granularity, we introduce a simple and lightweight module that maps video features into the textual feature space. Moreover, we also propose a character-refinement module to provide more precise information by identifying the main characters who play more significant role in the video context. With these unique designs, we further incorporate contextual information and a contrastive loss into our architecture to generate more smooth and contextual ADs. Experiments on the MAD-eval dataset show that Uni-AD can achieve state-of-the-art performance on AD generation, which demonstrates the effectiveness of our approach. Code will be available at https://github.com/MCG-NJU/Uni-AD.
翻译:音频描述(AD)任务旨在为视障人士生成视频中视觉元素的描述,帮助他们获取如电影等长视频内容。以视频特征、文本、角色库和上下文信息为输入,生成的AD能够按名称指代角色,并提供合理且具上下文关联的描述,以协助观众理解电影的故事情节。为实现这一目标,我们提出利用预训练基础模型,通过一个简洁统一的框架,以交错多模态序列为输入生成AD,命名为Uni-AD。为了以更细粒度增强不同模态间特征的语义对齐,我们引入了一个简单轻量的模块,将视频特征映射到文本特征空间。此外,我们还提出角色精炼模块,通过识别在视频上下文中扮演更关键角色的主要人物,提供更精确的信息。借助这些独特设计,我们进一步将上下文信息和对比损失融入架构,以生成更流畅且具上下文关联的AD。在MAD-eval数据集上的实验表明,Uni-AD在AD生成任务上达到了最先进的性能,验证了我们方法的有效性。代码将公开于https://github.com/MCG-NJU/Uni-AD。