Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
翻译:智能体编程要求智能体能够有效与运行时环境(如命令行界面)交互,以完成依赖项问题解决、系统故障修复等任务。然而,如何大规模获取此类环境密集型任务以增强智能体能力,目前仍缺乏深入探索。为此,基于Dockerfile与智能体任务之间的类比,我们提出利用智能体在运行反馈的引导下模拟和探索环境历史。通过追踪健康环境的历史记录,可将其状态逆向回退至存在运行时故障的早期状态,进而通过封装存在缺陷的状态及相应的错误信息来生成任务。基于我们提出的CLI-Gym方法,共衍生出1,655个环境密集型任务,构成当前该类别规模最大的数据集。此外,借助精选的成功轨迹,我们微调的LiberCoder模型在Terminal-Bench基准测试中实现了+21.1%(达到46.1%)的显著绝对性能提升,优于多种强基线方法。据我们所知,这是首个公开的、可扩展的环境密集型任务衍生流程。