Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.
翻译:标准准确性基准测试旨在测试大语言模型(LLM)接近正确答案的程度,但不适用于测试当模型被合理反论点质疑时是否会坚持正确答案。我们引入了一种受控协议来评估答案稳定性:在模型正确回答多项选择题后,我们提出一个连贯的、支持错误选项的论点来挑战模型的答案,并测量模型是否发生翻转。该设置:a) 将论证内容与显性社会压力隔离,b) 变化论证长度、自我归因和跨模型来源。在七个前沿模型和57个MMLU学科中,翻转率从17.5%到97.3%不等,揭示了准确性指标无法捕捉到的巨大稳定性差异。我们发现自我归因持续增加翻转率(平均+7.1个百分点,最高达+18.7个百分点)。此外,跨模型汇总错误答案的论点,并为每个问题选择最有效的论点,比依赖任何单一源模型能产生更强的对抗性挑战。我们进一步构建了MaxFlip,这是一个精心策划的挑战集,与标准的自生成挑战相比,可将翻转率提高多达+23.6个百分点。我们公开了协议、挑战记录和MaxFlip,以支持标准准确性基准测试之外的稳定性评估。材料可在https://github.com/nafisenik/WhoFlips和https://hf.co/datasets/nafisehNik/WhoFlips获取。