Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.
翻译:视觉问答(VQA)已成为提升用户体验的关键技术,尤其在视觉-语言模型(VLM)的泛化能力得到显著提升之后。然而,在实际应用场景中,基于标准化框架评估VLM是否满足特定应用需求仍具挑战性。本文旨在通过一个端到端框架解决此问题。我们提出了VQA360——一个从现有VQA基准数据集衍生构建的新型数据集,该数据集标注了任务类型、应用领域与知识类型,以支持全面评估。同时,我们引入了GoEval,这是一个基于GPT-4o开发的多模态评估指标,其与人类判断的相关系数达到56.71%。通过对前沿VLM的实验,我们发现没有单一模型能在所有场景中表现最优,因此正确选择模型成为关键设计决策。Gemini-1.5-Pro和GPT-4o-mini等专有模型总体表现更优,但InternVL-2-8B和CogVLM-2-Llama-3-19B等开源模型也展现出竞争优势,同时具备额外优势。本框架亦可扩展至其他任务。