Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized text attacks that affect many models equally, leaving open the question of whether it is possible to selectively degrade or enhance performance while minimally affecting other models. We formalize this problem and study selective adversarial attacks on MMLU - a widely used benchmark designed to measure a language model's broad general knowledge and reasoning ability across different subjects. Using canonical attacks integrated into TextAttack framework, we introduce a protocol for selectivity assessment, develop a custom constraint to increase selectivity of attacks and propose a surrogate-LLM pipeline that generates selective perturbations. Empirically, we find that selective adversarial attacks exist and can materially alter relative rankings, challenging the fairness, reproducibility, and transparency of leaderboard-driven evaluation. Our results motivate perturbation-aware reporting and robustness diagnostics for LLM evaluation and demonstrate that even subtle edits can shift comparative judgments.
翻译:基准测试结果日益主导着对大型语言模型的信任、选择与部署,然而这些评估依然容易受到语义等效对抗性扰动的影响。以往关于自然语言处理对抗鲁棒性的研究主要关注对多数模型产生同等影响的文本攻击,而能否在最小化影响其他模型的同时选择性地降低或提升特定模型的性能,这一问题尚未得到解答。我们形式化了该问题,并针对MMLU——一个旨在衡量语言模型跨学科广泛常识与推理能力的常用基准——研究了选择性对抗攻击。通过将经典攻击方法整合至TextAttack框架,我们提出了选择性评估协议,开发了用于提升攻击选择性的定制约束条件,并构建了生成选择性扰动的代理-LLM流程。实验表明,选择性对抗攻击确实存在,且能实质性地改变模型相对排名,这对排行榜驱动评估的公平性、可复现性与透明度提出了挑战。我们的研究结果呼吁在LLM评估中引入扰动感知报告机制与鲁棒性诊断方案,同时证明即使是细微的文本修改也可能改变比较性评判结果。