We introduce AutoCoder, the first Large Language Model to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test ($\mathbf{90.9\%}$ vs. $\mathbf{90.2\%}$). In addition, AutoCoder offers a more versatile code interpreter compared to GPT-4 Turbo and GPT-4o. It's code interpreter can install external packages instead of limiting to built-in packages. AutoCoder's training data is a multi-turn dialogue dataset created by a system combining agent interaction and external code execution verification, a method we term \textbf{\textsc{AIEV-Instruct}} (Instruction Tuning with Agent-Interaction and Execution-Verified). Compared to previous large-scale code dataset generation methods, \textsc{AIEV-Instruct} reduces dependence on proprietary large models and provides execution-validated code dataset. The code and the demo video is available in \url{https://github.com/bin123apple/AutoCoder}.
翻译:我们介绍了AutoCoder,这是首个在Human Eval基准测试的pass@1指标上超越GPT-4 Turbo(2024年4月版)和GPT-4o的大型语言模型($\mathbf{90.9\%}$ vs. $\mathbf{90.2\%}$)。此外,与GPT-4 Turbo和GPT-4o相比,AutoCoder提供了一个功能更全面的代码解释器。其代码解释器能够安装外部软件包,而非仅限于内置包。AutoCoder的训练数据是通过结合智能体交互与外部代码执行验证的系统所创建的多轮对话数据集,我们将此方法命名为\textbf{\textsc{AIEV-Instruct}}(基于智能体交互与执行验证的指令微调)。与先前的大规模代码数据集生成方法相比,\textsc{AIEV-Instruct}降低了对专有大模型的依赖,并提供了经过执行验证的代码数据集。代码及演示视频可在\url{https://github.com/bin123apple/AutoCoder}获取。