The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: \href{https://github.com/Mercury7353/PyBench}{https://github.com/Mercury7353/PyBench}
翻译:配备代码解释器的LLM智能体能够自动解决现实世界中的编码任务,例如数据分析和图像编辑。然而,现有基准测试主要关注两类任务:一类是过于简化的任务(如补全几行代码),另一类是仓库级别极其复杂且特定的任务,这两类均无法代表多样化的日常编码任务。为填补这一空白,我们提出了\textbf{PyBench}——一个涵盖五类主要现实世界任务、涉及超过10种文件类型的基准测试。给定一个高层次用户查询及相关文件,LLM智能体需要通过代码解释器进行多轮推理并执行Python代码,最终生成正式响应以满足用户需求。成功解决PyBench中的任务需要具备对多种Python包的深入理解、卓越的推理能力以及整合执行代码反馈的能力。我们的评估表明,当前开源LLM在此类任务上表现欠佳。为此,我们通过对四类数据集的分析与实验,证明解决PyBench需要综合能力。我们微调的8B规模模型\textbf{PyLlama3}在PyBench上取得了超越许多33B及70B规模模型的优异表现。我们的基准测试、训练数据集及模型已开源:\href{https://github.com/Mercury7353/PyBench}{https://github.com/Mercury7353/PyBench}