Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.
翻译:大型语言模型(LLMs)在竞赛编程问题上的表现日益成功,然而现有评估方法将算法推理与代码级实现混为一谈。我们认为竞赛编程本质上是问题求解任务,并提议在解决方案生成与评估中以自然语言题解为核心。在编写代码前生成题解可提升部分LLMs的解题率,当使用专家撰写的黄金题解时提升幅度尤为显著。然而,即使使用黄金题解,模型在实现环节仍存在困难,而生成题解与黄金题解之间的差距揭示了模型在规范正确完整算法时持续存在的问题求解瓶颈。除通过/失败指标外,我们通过专家标注对比模型生成题解与黄金标准来诊断推理错误,并验证了可扩展评估的LLM-as-a-judge方案。我们发布了包含83个ICPC风格问题及其黄金题解与完整测试用例的数据集,评估了19个LLMs,主张未来基准测试应明确分离问题求解与实现环节。