GPT-Vision has impressed us on a range of vision-language tasks, but it comes with the familiar new challenge: we have little idea of its capabilities and limitations. In our study, we formalize a process that many have instinctively been trying already to develop "grounded intuition" of this new model. Inspired by the recent movement away from benchmarking in favor of example-driven qualitative evaluation, we draw upon grounded theory and thematic analysis in social science and human-computer interaction to establish a rigorous framework for qualitative evaluation in natural language processing. We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting, counterfactual text in images, and relative spatial relationships. Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible.
翻译:GPT-Vision在一系列视觉-语言任务中给我们留下了深刻印象,但它也带来了一个熟悉的新挑战:我们对其能力与局限性几乎一无所知。在本研究中,我们形式化了一个许多人已经本能地在尝试的过程——即发展对该模型的“具身化直觉”。受近期从基准测试转向以实例驱动的定性评价这一趋势的启发,我们借鉴社会科学与人机交互中的扎根理论和主题分析方法,为自然语言处理中的定性评价建立了严谨的框架。我们运用该技术来检测科学图像的替代文本生成,发现GPT-Vision对提示输入、图像中的反事实文本以及相对空间关系尤为敏感。我们的方法与分析旨在帮助研究人员快速建立对新型模型的具身化直觉,同时揭示GPT-Vision可如何被应用于提升信息的可访问性。