Video content remains largely inaccessible to blind and low-vision (BLV) users. To address this, we introduce a prototype that leverages a multimodal agent - powered by a novel conversational architecture using a multimodal large language model (MLLM) - to provide BLV users with an interactive, accessible video experience. This Multimodal Agent Video Player (MAVP) demonstrates that an interactive accessibility mode can be added to a video through multilayered prompt orchestration. We describe a user-centered design process involving 18 sessions with BLV users that showed that BLV users do not just want accessibility features, but desire independence and personal agency over the viewing experience. We conducted a qualitative study with an additional 8 BLV participants; in this, we saw that the MAVP's conversational dialogue offers BLV users a sense of personal agency, fostering collaboration and trust. Even in the case of hallucinations, it is meta-conversational dialogues about AI's limitations that can repair trust.
翻译:视频内容对盲人和低视力用户而言仍然普遍难以访问。为解决这一问题,我们引入了一个原型系统,该系统利用多模态智能体——通过采用多模态大语言模型的新型对话架构驱动——为盲人和低视力用户提供交互式、可访问的视频体验。这款多模态智能体视频播放器证明,通过多层提示编排,可以为视频添加交互式无障碍模式。我们描述了一个以用户为中心的设计过程,其中包含与盲人和低视力用户进行的18次访谈,结果显示这些用户不仅需要无障碍功能,更渴望在观看体验中获得独立性和个人自主权。我们随后对另外8名盲人和低视力参与者进行了定性研究;研究发现,该播放器的对话式交互为用户提供了个人自主感,促进了协作与信任。即使在出现幻觉的情况下,正是关于人工智能局限性的元对话能够修复信任。