Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified, and modular architecture with a planning module that learns high-level subgoal generation, and a grounding module trained to translate these into actions using various tools in the execution module. The design allows for modular upgrades and wider applicability to diverse interactive tasks. To foster generalizable agent learning, we collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks. On 9 datasets, LUMOS exhibits several key advantages: (1) LUMOS excels multiple larger open-source agents on the held-out datasets (unused for training) for each task type. LUMOS even surpasses GPT agents on QA and web tasks; (2) LUMOS outperforms open-source agents produced by chain-of-thoughts and unmodularized integrated training; and (3) LUMOS effectively generalizes to unseen tasks, outperforming 33B-scale agents and domain-specific agents.
翻译:闭源智能体存在诸多问题,如成本高昂、缺乏透明度与可复现性,尤其在复杂交互任务中表现突出。这推动了开源替代方案的发展。我们提出了LUMOS——首个基于大语言模型的开源智能体训练框架之一。LUMOS采用可学习、统一且模块化的架构:其规划模块学习生成高层子目标,而基础模块则被训练为通过执行模块中的各类工具将这些子目标转化为具体动作。该设计支持模块化升级,并能广泛适用于多样化的交互任务。为促进智能体的泛化学习,我们从多种复杂交互任务中收集了大规模、统一且高质量的训练标注数据,这些数据源自多样化的真实推理过程。在9个数据集上的实验表明,LUMOS具有以下关键优势:(1)在各类任务对应的保留数据集(训练时未使用)上,LUMOS均优于多个规模更大的开源智能体,甚至在问答和网页任务中超越了GPT智能体;(2)LUMOS的表现优于通过思维链训练和非模块化集成训练产生的开源智能体;(3)LUMOS能有效泛化至未见任务,其性能超越330亿参数规模的智能体及领域专用智能体。