Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.
翻译:近期研究通过为语言模型(LM)配备外部工具或环境,推动了具备推理与行动能力的语言智能体发展。然而,大多数智能体仍依赖于现成语言模型的少样本提示技术。本文探究并论证了通过微调语言模型构建语言智能体这一被忽视的方向。基于谷歌搜索API的问答(QA)设置,我们考察了多种基础语言模型、提示方法、微调数据及问答任务,发现微调骨干语言模型后,语言智能体性能持续提升。例如,使用GPT-4生成的500条智能体轨迹微调Llama2-7B,可使HotpotQA性能提升77%。此外,我们提出FireAct——一种融合多任务轨迹与多种提示方法的语言模型微调新方案,表明更多样化的微调数据可进一步优化智能体。结合关于规模效应、鲁棒性、泛化性、效率与成本的其他发现,本研究系统论证了微调语言模型构建智能体的综合优势,并为语言智能体微调提供了初步的实验设计、见解及待解问题。