Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
翻译:环境是自我改进智能体的瓶颈。当前的终端基准测试集是为评估而非训练而构建;强化学习需要一个可扩展的流水线,而不仅仅是一个数据集。我们提出了无尽终端,这是一个完全自主的流水线,能够程序化生成无需人工标注的终端使用任务。该流水线包含四个阶段:生成多样化任务描述、构建并验证容器化环境、生成完成度测试、以及根据可解性进行筛选。通过此流水线,我们获得了涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作的3255个任务。我们使用带有二元回合级别奖励的原始PPO算法以及一个最小化的交互循环来训练智能体:不涉及检索、多智能体协调或专用工具。尽管设计如此简单,在无尽终端上训练的模型仍显示出显著提升:在我们预留的开发集上,Llama-3.2-3B从4.0%提升至18.2%,Qwen2.5-7B从10.7%提升至53.3%,Qwen3-8B-openthinker-sft从42.6%提升至59.0%。这些改进能够迁移至人工构建的基准测试集:在无尽终端上训练的模型在预留的人工构建基准测试集上同样表现出显著提升:在TerminalBench 2.0上,Llama-3.2-3B从0.0%提升至2.2%,Qwen2.5-7B从2.2%提升至3.4%,Qwen3-8B-openthinker-sft从1.1%提升至6.7%,在每种情况下均超越了包括采用更复杂智能体框架模型在内的其他方法。这些结果表明,当环境规模扩展时,简单的强化学习也能取得成功。