Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. We employ two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators. Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.
翻译:自动评估视觉-语言任务具有挑战性,尤其在反映人类判断方面存在局限,这主要源于对细粒度细节的考量不足。尽管GPT-4V已在多种多模态任务中展现出令人瞩目的成果,但将其作为通用评估器系统性地应用于这些任务尚未得到充分探索。我们全面验证了GPT-4V的评估能力,涵盖从基础图像到文本、文本到图像合成,到高层次图像到图像翻译以及多图像与文本对齐等任务。我们采用两种评估方法——单项评分与成对比较——借助GPT-4V进行验证。值得注意的是,GPT-4V在各类任务及评估方法中均展现出与人类高度的一致性,充分揭示了多模态大语言模型作为评估器的巨大潜力。尽管存在视觉清晰度评分受限、现实世界复杂推理等局限性,其提供与人类对齐的分数并辅以详细解释的能力,为通用自动评估器的发展开辟了广阔前景。