Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal care items, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues--such as blur, misframing, and rotation--affect the accuracy of VLM-generated captions and whether the resulting captions meet BLV people's information needs. Based on a survey of 86 BLV participants, we develop an annotated dataset of 1,859 product images from BLV people to systematically evaluate how image quality issues affect VLM-generated captions. While the best VLM achieves 98% accuracy on images with no quality issues, accuracy drops to 75% overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.
翻译:视觉语言模型(VLMs)正越来越多地被盲人或低视力(BLV)人群用于识别和理解日常生活中的产品,例如食品、个人护理用品和家庭日用品。尽管这些模型已广泛应用,但我们尚缺乏对常见图像质量问题(如模糊、构图不当、旋转)如何影响VLM生成描述准确性的实证理解,也不清楚由此产生的描述是否满足BLV人群的信息需求。基于对86名BLV参与者的问卷调查,我们构建了一个包含1859张来自BLV人群的产品图像的标注数据集,以系统评估图像质量问题对VLM生成描述的影响。结果发现,最佳VLM在无质量问题的图像上达到98%的准确率,但当存在质量问题时,整体准确率降至75%,且随着问题的叠加而进一步显著下降。我们讨论了在整个过程中以残障人士体验为中心的模型评估需求,并为人机交互和机器学习研究者提供了具体建议,以使VLM对BLV人群更加可靠。