Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.
翻译:大型多模态模型(LMMs)有望为盲人/低视力(BLV)群体开启自动化视觉辅助的新时代。然而,这些模型尚未在BLV用户采集的数据上进行系统性评估。为弥补这一缺口,我们对CLIP(一种广泛使用且可能支撑众多辅助技术的LMM)进行了实证评估。通过测试25种CLIP变体在零样本分类任务中的表现,我们发现:相较网络爬取图像,这些模型对BLV用户拍摄图像的分类准确率平均低15个百分点。这一差异源于CLIP对以下因素的敏感性:1)图像内容(如对残障相关物体的识别能力不及其他物体);2)图像质量(如图像光照变化的鲁棒性不足);3)文本内容(如对触觉形容词描述的物体识别能力弱于视觉形容词描述的物体)。通过对LAION-400M、LAION-2B和DataComp-1B三个常见预训练数据集进行文本分析,我们发现残障相关内容鲜有提及。随后通过三个实例,我们展示了该性能差异如何延伸至基于CLIP的三种下游模型:OWL-ViT、CLIPSeg和DALL-E2。研究表明,在某些场景下,仅需5张样本的小样本学习即可缓解BLU用户面临的CLIP服务质量不均问题——我们将在讨论其他潜在缓解措施时对此展开分析。