Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce UniCode, a generative evaluation framework that systematically probes LLM limits via: (1) multi-dimensional augmentation transforming seed problems into complex variations to disrupt fixed algorithmic patterns; (2) a highly reliable, automated test generation pipeline for scalable evaluation; and (3) fine-grained metrics for rich error signals. Experiments reveal a 31.2% performance collapse in state-of-the-art models on UniCode, primarily driven by deficiencies in conceptual modeling and scalability reasoning rather than syntactic errors. Furthermore, we uncover a seed-problem regression where models revert to memorized seed logic rather than following new specifications, signaling a reliance on shortcuts over reasoning. This work validates UniCode as a robust framework to expose model fragility and foster reasoning-oriented code intelligence.
翻译:当前代码基准测试常因静态范式与数据污染而高估大型语言模型(LLM)的实际能力,使模型得以利用统计捷径而非真正进行推理。为解决此问题,我们提出UniCode——一个生成式评估框架,通过以下方式系统性地探究LLM的局限:(1)多维增强机制将原始问题转化为复杂变体,以打破固定算法模式;(2)高可靠性的自动化测试生成流程,实现可扩展评估;(3)细粒度指标提供丰富的错误信号。实验表明,顶尖模型在UniCode上的性能骤降31.2%,主要源于概念建模与可扩展性推理的缺陷,而非语法错误。此外,我们发现了原始问题回归现象:模型倾向于回溯记忆中的原始问题逻辑而非遵循新规范,这标志着其对捷径的依赖超越实际推理。本研究验证了UniCode作为揭示模型脆弱性、促进面向推理的代码智能发展的鲁棒性框架。