Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.
翻译:数字视频在通信、教育和娱乐中占据核心地位,但若缺乏音频描述(AD),视障和低视力观众将被排除在外。尽管众包平台和视觉语言模型(VLM)扩展了音频描述的生产,其质量却很少得到系统性的检验。现有评估依赖于自然语言处理(NLP)指标和短视频片段指导原则,未能回答对于完整长度的内容而言何为质量,以及如何大规模进行评估的问题。为解决这些问题,我们首先基于专业指南并经由无障碍专家完善,开发了一个针对不间断、完整长度视频的多维度评估框架。其次,我们将此框架整合到一个综合的方法学工作流程中,利用项目反应理论(Item Response Theory),以评估VLM和人类评分者相对于专家确立的基准真值的熟练程度。研究结果表明,虽然VLM能够以高度一致性逼近基准真值评分,但其推理过程相较于人类受访者而言可靠性和可操作性较低。这些见解揭示了混合评估系统的潜力,即结合利用VLM与人工监督,为可扩展的音频描述质量控制提供了一条路径。