Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.
翻译:编写竞赛编程题目是一项严谨的工作。作者必须:设定约束条件、输入分布和边界情况以排除取巧解法;针对特定算法(如最大流、动态规划、数据结构)设计题目;并校准题目难度使其超出大多数参赛者的解决能力。我们认为这为测试大型语言模型的通用能力提供了理想场景,并研究其能否可靠完成此项任务。本文提出AutoCode系统,该系统通过多轮验证生成竞赛级别的问题描述与测试用例。在预留测试题集上,AutoCode生成的测试套件与官方评判结果的一致性接近99%,相比当前最优方法(如HardTests的81%以下)有显著提升。此外,从随机种子问题出发,AutoCode能够创建包含参考解与暴力解的新颖变体题目。通过将生成解与测试用例进行交叉验证,可进一步筛选出结构不良的问题。经专家验证,本系统能确保较高的正确性。AutoCode成功生成的新颖题目被最高级别(前0.3%)的竞赛编程专家评定为具备正式比赛质量。