As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
翻译:随着视觉语言模型(VLMs)的发展,用于帮助视障人士(PVIs)的以人为中心的辅助技术(ATs)正演变为能够同时执行多项任务的通用系统。然而,针对辅助技术的视觉语言模型基准测试仍处于探索不足的状态。为弥补这一差距,我们首先创建了一个新颖的辅助技术基准测试(@Bench)。基于一项与视障人士进行的预设计用户研究的指导,我们的基准测试包含了五个最关键的视觉语言任务:全景分割、深度估计、光学字符识别(OCR)、图像描述生成以及视觉问答(VQA)。此外,我们提出了一种新颖的辅助技术模型(@Model),该模型能够同时处理所有任务,并可扩展至更多辅助功能以帮助视障人士。我们的框架通过整合多模态信息,在各个任务上均展现出卓越的性能,并为视障人士提供了更全面的辅助。大量实验证明了我们框架的有效性和泛化能力。