Vision-language models (VLMs) have shown impressive abilities in text and image understanding. However, existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality, leading to two limitations: 1) it is challenging to identify which aspects of the text need improvement from the overall score; 2) metrics may overlook specific evaluation criteria when predicting an overall score. To address these limitations, we propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four vision-language tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.
翻译:视觉语言模型(VLMs)在文本和图像理解方面展现出卓越的能力。然而,现有评估VLM生成文本的指标仅关注整体质量,导致两个局限性:1)难以从整体得分中识别文本哪些方面需要改进;2)在预测整体得分时,指标可能忽略特定的评估准则。为应对这些局限性,我们提出HarmonicEval,一种无需参考的评估指标,它以自底向上的方式聚合各准则得分以生成整体分数。此外,我们构建了多任务多准则人工评估(MMHE)数据集,该数据集包含跨四个视觉语言任务的18,000条专家人工评判。实验表明,HarmonicEval在提供各准则数值评分的同时,比传统指标获得了与人类评判更高的一致性。