Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
翻译:近期研究声称,大型语言模型(LLMs)在竞技编程领域已超越人类顶尖选手。本研究基于一组国际算法竞赛奖牌得主的专业知识,重新审视这一论断,深入探究LLMs与人类专家之间的差异及其现存局限。我们提出了LiveCodeBench Pro——一个持续更新的基准测试集,其题目源自Codeforces、ICPC和IOI等平台,旨在降低数据污染的可能性。由奥赛奖牌得主组成的团队对每道题目进行算法分类标注,并对模型生成代码的失败提交进行逐行分析。基于该新数据集与基准测试,我们发现前沿模型仍存在显著局限:在不借助外部工具的情况下,最优模型在中等难度题目上仅能达到53%的pass@1通过率,在难题领域则完全无法通过(0%),而这些领域正是人类专家持续保持优势的阵地。研究同时表明,LLMs擅长实现密集型题目,但在精妙的算法推理与复杂案例分析方面表现欠佳,常生成看似自信实则错误的论证。模型的高性能主要源于实现精度与工具增强,而非卓越的推理能力。因此,LiveCodeBench Pro不仅揭示了当前模型与人类特级大师水平间的显著差距,更为代码中心化的LLM推理能力提供了细粒度诊断工具,以引导未来改进方向。