The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.
翻译:文本生成式视觉语言模型的评估是一项具有挑战性且至关重要的任务。通过解决现有视觉问答(VQA)基准测试的局限性并提出创新性评估方法,本研究旨在深化对这些模型能力的理解。我们提出了一种基于经典视觉分类数据集的新型VQA基准测试方法,该方法可对文本生成式视觉语言模型进行细粒度评估,并将其与判别式视觉语言模型进行对比分析。为改进细粒度分类任务中粗糙答案的评估效果,我们建议利用标签空间的语义层次结构,自动生成关于真实类别的后续追问问题。最后,我们比较了传统自然语言处理指标与基于大语言模型的指标在模型预测结果与真实答案匹配评估中的表现。通过人工评估研究确定了最终评估指标的选择依据。我们将该基准测试应用于一系列视觉语言模型,并详细比较了它们在物体、动作和属性分类任务中的能力。本研究的贡献旨在为更精准、更有意义的评估奠定基础,从而推动视觉语言建模这一前沿领域的针对性发展。