EVALOOOP：一种以自一致性为中心的评估框架，用于衡量大语言模型在编程任务中的鲁棒性 (EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming)

Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more critically, they operate solely through external perturbations, failing to capture the intrinsic stability essential for autonomous coding agents where subsequent inputs are endogenously generated by the model itself. We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization). EVALOOOP establishes a self-contained feedback loop where an LLM iteratively transforms between code and natural language until functional failure occurs, with robustness quantified by a novel Average Sustainable Loops (ASL) metric-the mean number of iterations maintaining functional correctness across benchmark tasks. This cyclical strategy intrinsically evaluates robustness without relying on external attack configurations, providing a unified metric that reveals how effectively LLMs preserve semantic integrity through sustained self-referential transformations. We evaluate 96 popular LLMs, ranging from 0.5B to 685B parameters, on EVALOOOP equipped with the MBPP Plus benchmark, and found that EVALOOOP typically induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, Qwen3-235B-A22B-Instruct-2507, despite inferior initial code generation compared to OpenAI's o-series models and DeepSeek-V3, demonstrated the superior robustness (ASL score).

翻译：评估大语言模型（LLMs）的编程鲁棒性对于确保其在基于人工智能的软件开发中的可靠性至关重要。然而，对抗性攻击存在根本性局限，损害了公平的鲁棒性评估：它们表现出相互矛盾的评估结果，即不同的攻击策略往往倾向于支持不同的模型；更重要的是，它们仅通过外部扰动进行操作，未能捕捉到自主编码代理所必需的内在稳定性，而此类代理的后续输入是由模型自身内生生成的。我们提出了EVALOOOP，这是一个新颖的评估框架，它从自一致性的角度评估鲁棒性，利用了软件工程任务（例如，代码生成与代码摘要）中固有的自然对偶性。EVALOOOP建立了一个自包含的反馈循环，其中LLM在代码和自然语言之间迭代转换，直至发生功能失效，其鲁棒性通过一种新颖的平均可持续循环次数（ASL）指标进行量化——即在基准任务中保持功能正确性的平均迭代次数。这种循环策略本质上评估了鲁棒性，无需依赖外部攻击配置，提供了一个统一的度量标准，揭示了LLM在持续的自指涉转换中如何有效地保持语义完整性。我们在配备MBPP Plus基准测试的EVALOOOP上评估了96个流行的LLM（参数量从0.5B到685B不等），发现EVALOOOP通常在十个循环内导致pass@1准确率绝对下降2.65%-47.62%。有趣的是，鲁棒性并不总是与初始性能（即一次性查询）一致；例如，Qwen3-235B-A22B-Instruct-2507尽管在初始代码生成方面逊于OpenAI的o系列模型和DeepSeek-V3，却表现出了卓越的鲁棒性（ASL得分）。