The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code-Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4's 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.
翻译:大语言模型的引入显著推进了代码生成技术的发展。然而,开源模型往往缺乏执行能力及高级系统(如GPT-4 Code Interpreter)所具有的迭代细化能力。为解决此问题,我们提出了OpenCodeInterpreter——一系列专为代码生成、执行和迭代细化设计的开源代码系统。借助Code-Feedback数据集(包含68K轮次多轮交互),OpenCodeInterpreter整合了执行过程与人类反馈,以实现动态代码细化。我们在HumanEval、MBPP及其EvalPlus增强版本等关键基准测试上对OpenCodeInterpreter进行了全面评估,结果显示出其卓越性能。值得注意的是,OpenCodeInterpreter-33B在HumanEval与MBPP的平均(及增强版本)上达到了83.2(76.4)的准确率,与GPT-4的84.2(76.2)表现相近,并在引入GPT-4生成的合成人类反馈后进一步将准确率提升至91.6(84.6)。OpenCodeInterpreter缩小了开源代码生成模型与GPT-4 Code Interpreter等专有系统之间的性能差距。