Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

翻译：文本生成式视觉语言模型的评估是一项具有挑战性且至关重要的任务。通过解决现有视觉问答（VQA）基准测试的局限性并提出创新性评估方法，本研究旨在深化对这些模型能力的理解。我们提出了一种基于经典视觉分类数据集的新型VQA基准测试方法，该方法可对文本生成式视觉语言模型进行细粒度评估，并将其与判别式视觉语言模型进行对比分析。为改进细粒度分类任务中粗糙答案的评估效果，我们建议利用标签空间的语义层次结构，自动生成关于真实类别的后续追问问题。最后，我们比较了传统自然语言处理指标与基于大语言模型的指标在模型预测结果与真实答案匹配评估中的表现。通过人工评估研究确定了最终评估指标的选择依据。我们将该基准测试应用于一系列视觉语言模型，并详细比较了它们在物体、动作和属性分类任务中的能力。本研究的贡献旨在为更精准、更有意义的评估奠定基础，从而推动视觉语言建模这一前沿领域的针对性发展。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日