The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.
翻译:文本生成式视觉-语言模型的评估是一项具有挑战性但至关重要的任务。通过解决现有视觉问答(VQA)基准的局限性并提出创新的评估方法,本研究旨在深化对这些模型能力的理解。我们基于广为人知的视觉分类数据集提出一种新型VQA基准,该基准能够对文本生成式视觉-语言模型进行细粒度评估,并实现其与判别式视觉-语言模型的比较。为改善细粒度分类任务中粗略答案的评估效果,我们建议利用标签空间的语义层次结构,针对真实标注类别自动生成后续追问问题。最后,我们比较了传统自然语言处理指标与基于大型语言模型的指标在基于真实答案评估模型预测时的表现,并基于人工评估研究确定最终指标。我们将所提出的基准应用于一系列视觉-语言模型,详细比较了它们在物体、动作和属性分类任务中的能力。本研究的贡献旨在为更精确、更有意义的评估奠定基础,从而推动视觉-语言建模这一激动人心领域的目标性进展。