VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

翻译：视觉语言模型（VLM）越来越多地被用作多模态系统的自动评判者，但其评分并未提供可靠性指标。我们通过保形预测（一种无分布框架，仅利用评分令牌的对数概率即可将评判者的点估计分数转化为校准后的预测区间，无需重新训练）研究该问题。我们首次系统分析了VLM作为评判者时保形预测的表现，涵盖3个评判者和14个视觉任务类别。结果表明评估不确定性具有强任务依赖性：美学和自然图像任务的区间覆盖评分范围约40%，而图表和数学推理任务则扩展至约70%，由此生成了多模态评估的定量可靠性图谱。我们进一步识别出标准评估指标无法捕获的失效模式——排序-评分解耦：评判者在保持高排序相关性的同时产生宽泛无信息的区间，即能正确排列响应顺序但无法给出可靠的绝对分数。最后，我们发现区间宽度主要由任务难度和标注质量决定：在同一评判者和方法下，干净、多标注者的描述基准测试产生的区间宽度缩小了4.5倍。代码：https://github.com/divake/VLM-Judge-Uncertainty

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

多模态幻觉的评估与检测综述

专知会员服务

18+阅读 · 2025年7月28日

当持续学习遇上多模态大型语言模型：综述

专知会员服务

32+阅读 · 2025年3月5日

大规模视觉-语言模型的基准、评估、应用与挑战

专知会员服务

18+阅读 · 2025年2月10日