Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, results in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. Based on the GenCeption method, we establish the MMECeption benchmark for evaluating Vision LLMs (VLLMs), and compare performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lack behind human performance and struggle especially with text-intensive tasks.
翻译:多模态大语言模型(MLLMs)通常通过昂贵的标注多模态基准进行评估,这种方法往往滞后于MLLM评估快速发展的需求。本文提出并验证了GenCeption——一种新颖的、无需标注的评估方法,该方法仅需单模态数据即可衡量跨模态语义一致性,并逆向评估MLLMs产生幻觉的倾向。此方法消除了昂贵数据标注的需求,最小化训练数据污染风险,减缓基准饱和速度,并避免新兴能力幻觉的产生。受DrawCeption游戏启发,GenCeption从非文本样本开始,通过迭代描述与生成步骤推进。使用GC@T指标量化迭代过程中的语义漂移。基于GenCeption方法,我们建立了用于评估视觉大语言模型(VLLMs)的MMECeption基准,并比较了多个主流VLLMs与人类标注者的性能。实证结果验证了GenCeption的有效性,显示其与现有VLLM基准存在强相关性。VLLMs仍显著落后于人类表现,尤其在文本密集型任务中面临明显困难。