Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

翻译：文本生成式视觉-语言模型的评估是一项具有挑战性但至关重要的任务。通过解决现有视觉问答（VQA）基准的局限性并提出创新的评估方法，本研究旨在深化对这些模型能力的理解。我们基于广为人知的视觉分类数据集提出一种新型VQA基准，该基准能够对文本生成式视觉-语言模型进行细粒度评估，并实现其与判别式视觉-语言模型的比较。为改善细粒度分类任务中粗略答案的评估效果，我们建议利用标签空间的语义层次结构，针对真实标注类别自动生成后续追问问题。最后，我们比较了传统自然语言处理指标与基于大型语言模型的指标在基于真实答案评估模型预测时的表现，并基于人工评估研究确定最终指标。我们将所提出的基准应用于一系列视觉-语言模型，详细比较了它们在物体、动作和属性分类任务中的能力。本研究的贡献旨在为更精确、更有意义的评估奠定基础，从而推动视觉-语言建模这一激动人心领域的目标性进展。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日