LiveCodeBench Pro：国际奥赛奖牌得主如何评价大语言模型在竞技编程中的表现？ (LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?)

Zihan Zheng,Zerui Cheng,Zeyu Shen,Shang Zhou,Kaiyuan Liu,Hansen He,Dongruixuan Li,Stanley Wei,Hangyi Hao,Jianzhu Yao,Peiyao Sheng,Zixuan Wang,Wenhao Chai,Aleksandra Korolova,Peter Henderson,Sanjeev Arora,Pramod Viswanath,Jingbo Shang,Saining Xie

from arxiv, Project Page at https://livecodebenchpro.com/

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

翻译：近期研究声称，大型语言模型（LLMs）在竞技编程领域已超越人类顶尖选手。本研究基于一组国际算法竞赛奖牌得主的专业知识，重新审视这一论断，深入探究LLMs与人类专家之间的差异及其现存局限。我们提出了LiveCodeBench Pro——一个持续更新的基准测试集，其题目源自Codeforces、ICPC和IOI等平台，旨在降低数据污染的可能性。由奥赛奖牌得主组成的团队对每道题目进行算法分类标注，并对模型生成代码的失败提交进行逐行分析。基于该新数据集与基准测试，我们发现前沿模型仍存在显著局限：在不借助外部工具的情况下，最优模型在中等难度题目上仅能达到53%的pass@1通过率，在难题领域则完全无法通过（0%），而这些领域正是人类专家持续保持优势的阵地。研究同时表明，LLMs擅长实现密集型题目，但在精妙的算法推理与复杂案例分析方面表现欠佳，常生成看似自信实则错误的论证。模型的高性能主要源于实现精度与工具增强，而非卓越的推理能力。因此，LiveCodeBench Pro不仅揭示了当前模型与人类特级大师水平间的显著差距，更为代码中心化的LLM推理能力提供了细粒度诊断工具，以引导未来改进方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日