Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. The datasets are available at https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.
翻译:大规模视觉语言模型(LVLMs)能够同时处理图像和文本,在图像描述生成等跨模态任务中表现出色。然而,尽管这些模型擅长生成事实性内容,但其根据上下文生成和评估反映同一图像不同视角文本的能力尚未得到充分探索。为此,我们提出IRR:图像评论排序框架,这是一种新颖的评估框架,旨在从多角度评估评论性文本。IRR通过衡量LVLMs的判断与人类解释的接近程度来评估其性能。我们使用包含15个类别的图像数据集进行验证,每张图像配有五条评论性文本及英日双语标注排序,总计超过2000条数据实例。数据集可通过https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0获取。实验结果表明,尽管LVLMs在不同语言间表现出稳定性能,但其与人类标注的相关性仍显不足,凸显了进一步改进的必要性。这些发现揭示了当前评估方法的局限性,并强调需要开发能更好捕捉人类推理机制的视觉与语言任务评估方法。