ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

翻译：大型语言模型（LLMs）作为工具增强型智能体正日益广泛地应用于多步决策任务，然而训练鲁棒的工具使用智能体仍面临挑战。现有方法仍需人工干预、依赖于不可验证的模拟环境、仅采用监督微调（SFT）或强化学习（RL）中的单一范式，且在稳定的长周期多轮次学习方面存在困难。为应对这些挑战，我们提出ASTRA——一个通过可扩展数据合成与可验证强化学习实现工具增强型语言模型智能体训练的端到端全自动框架。ASTRA整合了两个互补组件：首先，基于工具调用图静态拓扑的流水线可合成多样化、结构化的轨迹，从而注入广泛且可迁移的工具使用能力；其次，通过捕捉人类语义推理的丰富组合拓扑，环境合成框架将分解后的问答轨迹转化为独立、可代码执行且规则可验证的环境，实现确定性的多轮次强化学习。基于此方法，我们开发了统一训练方案，利用轨迹级奖励将监督微调与在线强化学习相结合，以平衡任务完成度与交互效率。在多个工具使用智能体基准测试上的实验表明，ASTRA训练的模型在同等规模下达到最先进性能，在保持核心推理能力的同时逼近闭源系统水平。我们已在https://github.com/LianjiaTech/astra发布完整流水线、环境与训练模型。