Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.
翻译:近期多模态大语言模型展现出强大的推理能力,但其作为自动化评估器的可靠性仍受制于一个关键缺陷:当视觉证据与文本线索冲突时,多模态大语言模型评判器倾向于奖励具有合理叙事性的回答,而非基于感知正确的答案。我们识别并系统分析了这一现象,并将其定义为"感知判断偏差"。通过可控视觉扰动实验,现有模态大语言模型评判器经常固守于响应文本而非自身视觉感知,导致评估结果不一致且不可验证。为解决此问题,我们构建了"感知扰动判断数据集",该数据集通过最小编辑生成反事实响应,以隔离感知错误并提供可验证的监督信号。在此数据集基础上,我们开发了统一训练框架,将基于结构化GRPO的奖励与批次排序目标相结合,无需显式成对标签即可实现全局有序排列。在多模态大语言模型评判器基准测试上的实验表明,我们的方法显著提升了感知保真度、排序一致性和与人类评估的一致性。研究结果为训练具备感知基础、可解释性且对视觉-推理冲突具有鲁棒性的多模态评判器,开辟了一条可扩展且具泛化性的技术路径。