Audio description (AD) makes video content accessible to blind and low-vision (BLV) audiences, but producing high-quality descriptions is resource-intensive. Automated AD offers scalability, and prior studies show human-in-the-loop editing and user queries effectively improve narration. We introduce ADx3, a novel framework integrating these three modules: GenAD, upgrading baseline description generation with modern vision-language models (VLMs) guided by accessibility-informed prompting; RefineAD, supporting BLV and sighted users to view and edit drafts through an inclusive interface; and AdaptAD, enabling on-demand user queries. We evaluated GenAD in a study where seven accessibility specialists reviewed VLM-generated descriptions using professional guidelines. Findings show that with tailored prompting, VLMs produce good descriptions meeting basic standards, but excellent descriptions require human edits (RefineAD) and interaction (AdaptAD). ADx3 demonstrates collaborative workflows for accessible content creation, where components reinforce one another and enable continuous improvement: edits guide future baselines and user queries reveal gaps in AI-generated and human-authored descriptions.
翻译:音频描述(AD)使盲人和低视力(BLV)观众能够访问视频内容,但制作高质量描述需要大量资源。自动化音频描述提供了可扩展性,先前研究表明,人在环编辑和用户查询能有效改进叙述。我们提出了ADx3,这是一个新颖的框架,集成了以下三个模块:GenAD,通过由无障碍意识提示引导的现代视觉语言模型(VLM)升级基线描述生成;RefineAD,支持BLV用户和视力正常的用户通过包容性界面查看和编辑草稿;以及AdaptAD,支持按需用户查询。我们在一项研究中评估了GenAD,七位无障碍专家使用专业指南审查了VLM生成的描述。研究结果表明,通过定制化提示,VLM能够生成符合基本标准的良好描述,但优秀的描述需要人工编辑(RefineAD)和交互(AdaptAD)。ADx3展示了用于无障碍内容创作的协同工作流,其中各组件相互强化并支持持续改进:编辑指导未来的基线,用户查询则揭示AI生成和人工撰写描述中的不足。