Endless Terminals: Scaling RL Environments for Terminal Agents

Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

翻译：环境是自改进智能体的瓶颈。当前的终端基准测试是为评估而非训练而构建的；强化学习需要可扩展的流程，而不仅仅是数据集。我们提出了无尽终端，这是一个完全自主的流程，能够程序化生成无需人工标注的终端使用任务。该流程包含四个阶段：生成多样化的任务描述、构建并验证容器化环境、生成完成度测试、以及根据可解性进行筛选。通过此流程，我们获得了3255个任务，涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作。我们使用带有二元回合级奖励的原始PPO算法以及一个最小化的交互循环来训练智能体：不涉及检索、多智能体协调或专用工具。尽管方法如此简单，在无尽终端上训练的模型仍显示出显著提升：在我们保留的开发集上，Llama-3.2-3B从4.0%提升至18.2%，Qwen2.5-7B从10.7%提升至53.3%，Qwen3-8B-openthinker-sft从42.6%提升至59.0%。这些改进能够迁移到人工整理的基准测试上：在无尽终端上训练的模型在保留的人工整理基准测试上同样表现出显著提升：在TerminalBench 2.0上，Llama-3.2-3B从0.0%提升至2.2%，Qwen2.5-7B从2.2%提升至3.4%，Qwen3-8B-openthinker-sft从1.1%提升至6.7%，在每种情况下均优于包括采用更复杂智能体框架的模型在内的其他方法。这些结果表明，当环境能够扩展时，简单的强化学习即可取得成功。