This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.
翻译:本文探讨了多模态大语言模型作为视障人士辅助技术的有效性。我们通过用户调查分析了此类技术的采用模式及用户面临的主要挑战。尽管这些模型的采用率较高,但研究发现其在情境理解、文化敏感性和复杂场景理解方面存在不足,特别是对于可能完全依赖这些模型进行视觉解读的个体。基于这些发现,我们整合了五项以用户为中心的图像与视频输入任务,包括一项新颖的光学盲文识别任务。通过对十二个多模态大语言模型的系统评估,我们发现需要在文化语境理解、多语言支持、盲文阅读理解、辅助物体识别以及幻觉控制等方面取得进一步突破。本研究为可访问性多模态人工智能的未来发展方向提供了重要见解,强调了开发更具包容性、鲁棒性和可信赖的视觉辅助技术的迫切需求。