Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion. Project page: https://stjohn2007.github.io/MMHE_project/
翻译:视觉语言模型(VLMs)在一系列多模态任务中展现出卓越的能力。然而,现有评估VLM生成文本质量的指标通常侧重于针对特定任务(如图像描述生成)的整体评估。尽管整体评估对任何任务都至关重要,但不同任务所侧重的评估准则可能有所不同,这使得现有指标难以适应多任务场景。为应对这一局限,我们提出了HarmonicEval,一种无需参考的综合性评估指标,它以自底向上的方式聚合各准则得分以生成总体分数。此外,为评估自动评估指标在多任务场景中的泛化能力,我们构建了多任务多准则人工评估(MMHE)基准,该基准包含跨四个多模态任务的18,000条专家人工评判。实验表明,与常规指标相比,HarmonicEval在提供各准则数值评分的同时,与人工评判具有更高的相关性。项目页面:https://stjohn2007.github.io/MMHE_project/