Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information. Unlike traditional visual interpretation tools that only provide descriptions, MLLM-enabled applications offer conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and implications for BLV people's daily lives remains limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "trustworthy" (mean=3.76 out of 5, max=extremely trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to users' requests. Our findings show that while MLLMs can improve visual interpretations' descriptive accuracy, supporting everyday use also depends on the "visual assistant" skill: behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people's access to visual information.
翻译:多模态大语言模型正在改变盲人与低视力人群获取视觉信息的方式。与传统仅提供描述的视觉解读工具不同,基于MLLM的应用程序能够提供对话式协助,用户可通过提问获取与目标相关的细节。然而,关于其在真实场景中的表现及其对BLV人群日常生活影响的实证证据仍然有限。为此,我们开展了一项为期两周的日记研究,记录了20名BLV参与者使用基于MLLM的视觉解读应用的情况。尽管参与者对该应用的视觉解读评价为"可信赖"(均值=3.76/5,最高为极其可信)和"基本满意"(均值=4.13/5,最高为非常满意),但人工智能经常产生错误答案(22.2%)或拒绝响应用户请求(10.8%)。我们的研究结果表明,虽然MLLM能提升视觉解读的描述准确性,但支持日常使用还取决于"视觉助手"技能:即提供目标导向、可靠协助的行为模式。最后,我们提出"视觉助手"技能及相关设计准则,以帮助基于MLLM的视觉解读应用更好地支持BLV人群获取视觉信息。