On the Adversarial Robustness of Multimodal LLM Judges

Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at https://github.com/mala-lab/RobustMLLMJudge.

翻译：多模态大语言模型（MLLMs）正越来越多地被用作自动化评判器，例如用于图像质量和安全评估。然而，其对抗鲁棒性在很大程度上仍未被探索，这威胁到自动化评判的公平性和可靠性。为弥补这一空白，我们提出了RobustMLLMJudge，这是首个用于评估通用型MLLMs在充当评判器时对抗鲁棒性的通用框架。它涵盖了针对多种主流评判方法在质量和安全评估场景下的多样化攻击。利用RobustMLLMJudge，我们发现：i) 不同的MLLM评判器对分数膨胀型对抗攻击高度脆弱；ii) 尽管这些攻击方法有效，但受限于MLLM评判器评估协议中的独特约束，它们面临关键挑战。我们进一步提出MGSIA，即流形引导语义诱导攻击（Manifold-Guided Semantic Induction Attack），这是一种新颖方法，可绕过这些约束，从而实现对MLLM评判器更有效且可迁移的攻击。MGSIA的核心思想是将肯定语义诱导与高分流形对齐相结合：它在最大化评判器对二元语义查询给出肯定响应（例如“是”）概率的同时，通过从代理协议估计的高分中心来正则化对抗表示。这些目标协同产生可迁移的分数膨胀型扰动。大量实验证明了MGSIA在不同评估场景下欺骗先进MLLM评判器的优越性和泛化能力，突显了对鲁棒MLLM评判器的需求。代码和数据将在https://github.com/mala-lab/RobustMLLMJudge 提供。