Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as $τ^2$-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.
翻译:训练能够适应多样化场景的通用智能体需要交互式环境以支持自主探索。然而,交互式环境仍然极度稀缺,且现有的合成方法在环境多样性和可扩展性方面存在显著局限。为应对这些挑战,我们提出了ScaleEnv,一个完全从零开始构建全交互式环境与可验证任务的框架。具体而言,ScaleEnv通过程序化测试确保环境可靠性,并通过工具依赖图扩展与可执行动作验证来保证任务的完整性与可解性。通过使智能体能够在ScaleEnv中进行探索式学习,我们在未见过的多轮次工具使用基准测试(如$τ^2$-Bench和VitaBench)上展示了显著的性能提升,突显了强大的泛化能力。此外,我们研究了领域数量增加与模型泛化性能之间的关系,提供了经验证据表明扩展环境多样性对于实现鲁棒的智能体学习至关重要。