Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
翻译:音频描述(ADs)传递屏幕上的关键信息,使视障观众能够跟随视频内容。为达到有效传达,音频描述必须形成连贯的序列,帮助听众可视化逐步展开的场景,而非孤立地描述单个时刻。然而,大多数自动方法独立生成每条音频描述,常导致重复且不连贯的描述。为解决此问题,我们提出一种无需训练的方法CoherentAD,该方法首先生成每个AD时间间隔的多个候选描述,随后在序列上进行自回归选择,以构建连贯且信息丰富的叙述。为整体评估AD序列,我们引入一个序列级指标StoryRecall,用于衡量预测的ADs传达真实叙述的准确程度,同时结合捕捉连续AD输出间冗余度的重复性指标。我们的方法生成的连贯AD序列具有更强的叙事理解能力,优于依赖独立生成的先前方法。