Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce $\mathrm{CA^3D}$, the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, $\mathrm{CA^3D}$ system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine. Unlike conventional methods which rely on metadata and ground truth AD timestamp for AD detection and generation tasks, the proposed $\mathrm{CA^3D}$ is the first end-to-end trainable system that only uses visual cue. Extensive experiments demonstrate that the proposed $\mathrm{CA^3D}$ improves existing architectures for both AD event detection and script generation metrics, establishing the new state-of-the-art performances in the AD automation.
翻译:音频描述(AD)作为旨在保障多媒体内容可访问性的应用系统发挥着关键作用,它在适当的间隔提供额外的叙述以描述视觉元素,专门满足视障观众的需求。本文介绍了开创性的统一上下文感知自动音频描述系统 $\mathrm{CA^3D}$,该系统能为长篇影视内容生成带有精确位置信息的音频描述事件脚本。具体而言,$\mathrm{CA^3D}$ 系统包含:1)用于高效捕获长期依赖性的时序特征增强模块;2)基于锚点的音频描述事件检测器(配备特征抑制模块),用于定位音频描述事件并提取判别性特征以生成音频描述;3)自优化模块,利用生成输出来从粗到细地调整音频描述事件的边界。与依赖元数据和真实音频描述时间戳进行音频描述检测与生成的传统方法不同,所提出的 $\mathrm{CA^3D}$ 是首个仅使用视觉线索即可端到端训练的系统。大量实验表明,所提出的 $\mathrm{CA^3D}$ 在音频描述事件检测和脚本生成指标上均改进了现有架构,在音频描述自动化领域确立了新的最先进性能。