Endless Terminals: Scaling RL Environments for Terminal Agents

Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

翻译：环境是自改进智能体的瓶颈。当前终端基准测试集为评估而非训练而构建；强化学习需要可扩展的流水线，而不仅仅是数据集。我们提出“无尽终端”，这是一个完全自主的流水线，能够程序化生成无需人工标注的终端使用任务。该流水线包含四个阶段：生成多样化任务描述、构建并验证容器化环境、生成完成度测试、以及筛选可解性。通过此流水线，我们获得了涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作等领域的3255个任务。我们使用带二元回合级奖励的原始PPO算法和最小化交互循环（无检索、多智能体协调或专用工具）训练智能体。尽管设计简洁，在无尽终端上训练的模型仍展现出显著提升：在我们预留的开发集上，Llama-3.2-3B从4.0%提升至18.2%，Qwen2.5-7B从10.7%提升至53.3%，Qwen3-8B-openthinker-sft从42.6%提升至59.0%。这些改进可迁移至人工标注的基准测试：在预留的人工标注基准测试中，经过无尽终端训练的模型均取得显著进步：在TerminalBench 2.0上，Llama-3.2-3B从0.0%提升至2.2%，Qwen2.5-7B从2.2%提升至3.4%，Qwen3-8B-openthinker-sft从1.1%提升至6.7%，在所有案例中均优于包括采用更复杂智能体框架模型在内的替代方案。这些结果表明：当环境规模扩展时，简单的强化学习方法即可获得成功。