Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.
翻译:尽管大语言模型(LLMs)在竞赛编程中表现出色,但它们在相同环境下支持人类学习的角色仍未被充分探索。本文提出UOJ-Bench基准,旨在评估LLMs不仅解决问题的能力,更着重考察其识别人类编写代码中错误的能力——这一传统上通过在线评测系统运行测试用例支持的关键教育活动。UOJ-Bench包含三项独立任务:代码生成、代码漏洞发现与代码修复,所有任务均基于通用在线评测系统(UOJ)的真实代码提交记录构建,并通过UOJ原生测评基础设施进行评估。结果表明,在单次推理评估下,即使最强模型也无法识别出超过50%的被UOJ用户判定为错误提交的缺陷代码。尽管测试时计算扩展可将成功率提升至90%以上,但模型推理产生的大量计算成本限制了其大规模部署的实用性。尽管存在这些局限,我们发现采用测试时计算扩展的最优模型,在约30道题目中能发现超过5%满分提交中的潜在错误,这表明前沿LLMs已能提供超越标准评测系统的补充性信号。