Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information in their daily lives. Unlike traditional visual interpretation tools that provide access through captions and OCR (text recognition through camera input), MLLM-enabled applications support access through conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and their implications for BLV people's everyday life remain limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "somewhat trustworthy" (mean=3.76 out of 5, max=very trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to follow-up requests. Our work demonstrates that MLLMs can improve the accuracy of descriptive visual interpretations, but that supporting everyday use also depends on the "visual assistant" skill -- a set of behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and practical guidelines to help future MLLM-enabled visual interpretation applications better support BLV people's access to visual information.
翻译:多模态大语言模型(MLLMs)正在改变盲人与低视力(BLV)人群在日常生活中获取视觉信息的方式。与通过字幕和OCR(通过摄像头输入进行文本识别)提供访问途径的传统视觉解读工具不同,基于MLLM的应用程序通过对话式辅助支持信息获取,用户可通过提问获得与目标相关的细节。然而,关于其在现实世界中的表现及其对BLV人群日常生活影响的证据仍然有限。为此,我们开展了一项为期两周的日记研究,记录了20名BLV参与者使用基于MLLM的视觉解读应用程序的情况。尽管参与者对该应用程序的视觉解读评价为"较为可信"(平均值=3.76分,满分5分,最高为非常可信)和"较为满意"(平均值=4.13分,满分5分,最高为非常满意),但人工智能经常给出错误答案(22.2%)或拒绝响应后续请求(10.8%)。我们的研究表明,MLLMs能够提高描述性视觉解读的准确性,但支持日常使用还取决于"视觉助手"技能——一套提供目标导向、可靠辅助的行为准则。最后,我们提出"视觉助手"技能及实用指南,以帮助未来基于MLLM的视觉解读应用程序更好地支持BLV人群获取视觉信息。