Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
翻译:尽管大型语言模型在终端能力方面近期进展迅速,但支撑顶尖终端智能体的训练数据策略在很大程度上仍未公开。我们通过系统研究终端智能体的数据工程实践来填补这一空白,并做出两项关键贡献:(1) Terminal-Task-Gen,一个支持基于种子和基于技能的任务构建的轻量级合成任务生成流水线;(2) 对数据与训练策略的全面分析,包括数据过滤、课程学习、长上下文训练及扩展行为。我们的流水线产出了 Terminal-Corpus,这是一个用于终端任务的大规模开源数据集。利用该数据集,我们训练了 Nemotron-Terminal 模型系列(基于 Qwen3(8B, 14B, 32B) 初始化),其在 Terminal-Bench 2.0 基准上取得了显著提升:Nemotron-Terminal-8B 从 2.5% 提升至 13.0%,Nemotron-Terminal-14B 从 4.0% 提升至 20.2%,Nemotron-Terminal-32B 从 3.4% 提升至 27.4%,其性能与规模显著更大的模型相当。为加速该领域的研究,我们在 https://huggingface.co/collections/nvidia/nemotron-terminal 开源了模型检查点及大部分合成数据集。