ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

翻译：大型语言模型（LLM）越来越多地被用作工具增强型智能体进行多步决策，但训练鲁棒的工具使用智能体仍然具有挑战性。现有方法仍需人工干预、依赖不可验证的模拟环境、仅使用监督微调（SFT）或强化学习（RL）中的单一范式，且在稳定的长视野、多轮次学习方面存在困难。为应对这些挑战，我们提出了ASTRA，一个通过可扩展数据合成与可验证强化学习来训练工具增强型语言模型智能体的全自动端到端框架。ASTRA整合了两个互补的组件。首先，一个利用工具调用图静态拓扑结构的流程合成了多样化、结构上扎根的轨迹，从而注入广泛且可迁移的工具使用能力。其次，一个捕捉人类语义推理丰富组合拓扑的环境合成框架，将分解后的问答轨迹转换为独立的、可代码执行的、规则可验证的环境，从而实现确定性的多轮次强化学习。基于此方法，我们开发了一种统一的训练方法，通过轨迹级奖励将SFT与在线RL相结合，以平衡任务完成度与交互效率。在多个工具使用智能体基准测试上的实验表明，ASTRA训练的模型在可比规模下实现了最先进的性能，接近闭源系统，同时保留了核心推理能力。我们在https://github.com/LianjiaTech/astra发布了完整的流程、环境及训练好的模型。