As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
翻译:随着编码智能体在大型代码库中的部署日益增多,自动设计具有挑战性的代码库级评估需求变得至关重要。我们提出Gistify任务,要求编码大语言模型创建一个单一、最小化、自包含的文件,以复现代码库的特定功能。编码大语言模型被授予对代码库的完全访问权限及特定入口点(例如一条Python命令),生成的文件必须在完整代码库环境下运行相同命令并复现其输出,同时仅包含执行所提供命令所必需的核心组件。成功完成Gistify任务需要同时具备对代码库的结构化理解能力、对其执行流程的精确建模能力,以及生成潜在大规模代码补丁的能力。我们的研究结果表明,当前最先进的模型难以可靠解决Gistify任务,尤其是那些具有长执行轨迹的任务。