Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.
翻译:音频描述是一种叙述性解说,旨在帮助视障观众感知视频中的关键视觉元素。尽管短视频理解技术发展迅速,但如何保持连贯的长时视觉叙事仍然是一个悬而未决的问题。现有方法仅依赖于帧级嵌入,能够有效描述基于对象的内容,但缺乏跨场景的上下文信息。我们提出了DANTE-AD,一种增强的视频描述模型,它利用基于双视角Transformer的架构来解决这一不足。DANTE-AD通过顺序融合帧级和场景级嵌入来提升长时上下文理解能力。我们提出了一种新颖的、最先进的序列交叉注意力方法,以实现细粒度音频描述生成的上下文基础。在来自知名电影片段的大量关键场景上进行评估后,DANTE-AD在传统自然语言处理指标和基于大语言模型的评估中均优于现有方法。