Large-scale vision language models (LVLMs) are language models that are capable of processing images and text inputs by a single model. This paper explores the use of LVLMs to generate review texts for images. The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities. Unlike image captions, review texts can be written from various perspectives such as image composition and exposure. This diversity of review perspectives makes it difficult to uniquely determine a single correct review for an image. To address this challenge, we introduce an evaluation method based on rank correlation analysis, in which review texts are ranked by humans and LVLMs, then, measures the correlation between these rankings. We further validate this approach by creating a benchmark dataset aimed at assessing the image review ability of recent LVLMs. Our experiments with the dataset reveal that LVLMs, particularly those with proven superiority in other evaluative contexts, excel at distinguishing between high-quality and substandard image reviews.
翻译:大规模视觉语言模型(LVLMs)是能够通过单一模型处理图像和文本输入的语言模型。本文探讨了利用LVLMs生成图像评论文本的能力。目前对LVLMs评析图像的能力尚未充分了解,亟需对其评析能力进行系统性评估。与图像描述不同,评论文本可从构图、曝光等多种视角撰写。这种评论视角的多样性导致难以确定单个图像唯一的正确评价。针对这一挑战,我们提出基于秩相关分析的评估方法:由人类和LVLMs分别对评论文本排序,再衡量两组排序结果的相关性。我们进一步通过构建基准数据集验证该方法,该数据集旨在评估近期LVLMs的图像评析能力。实验结果表明,LVLMs——尤其是那些在其他评估场景中被证明表现优越的模型——在区分高质量与低质量图像评论方面表现出色。