Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with 1,279 questions from CruxEval and LiveCodeBench, designed to evaluate reasoning reliability under structural perturbations and misleading natural language (NL) contexts. Through a systematic evaluation of 17 LLMs, we find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks. Even with Chain-of-Thought reasoning, models on average still have a 13.8% drop due to distractibility and rationalization, revealing a lack of critical reasoning capability to distinguish the actual code behaviors. While Large Reasoning Models with internal reasoning mechanisms improve robustness by fostering critical thinking, plausible yet incorrect hints can trigger pathological self-reflection, causing 2-3 times token consumption and even catastrophic cognitive dissonance in extreme cases for QwQ-32B. We refer to this phenomenon as Reasoning Collapse. CodeCrash provides a rigorous benchmark for evaluating robustness in code reasoning, guiding future research and development toward more reliable and resilient models.
翻译:大语言模型(LLMs)近期在代码相关任务中展现出强大能力,但其在扰动下的代码推理鲁棒性仍未得到充分探索。我们提出了CodeCrash,一个包含来自CruxEval和LiveCodeBench的1,279个问题的压力测试框架,旨在评估模型在结构化扰动和误导性自然语言(NL)上下文下的推理可靠性。通过对17个LLMs进行系统评估,我们发现模型常常过度依赖自然语言线索而走推理捷径,导致在输出预测任务中平均性能下降23.2%。即使采用思维链推理,由于分心性和合理化倾向,模型平均仍有13.8%的性能下降,这揭示了其缺乏区分实际代码行为的关键推理能力。虽然具备内部推理机制的大型推理模型通过培养批判性思维提高了鲁棒性,但看似合理实则错误的提示可能引发病态的自我反思,导致QwQ-32B等模型在极端情况下产生2-3倍的令牌消耗,甚至灾难性的认知失调。我们将此现象称为推理崩溃。CodeCrash为评估代码推理鲁棒性提供了一个严谨的基准,可指导未来朝着构建更可靠、更具韧性的模型进行研究和开发。