Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.
翻译:大型语言模型(LLMs)为构建能够通过迭代环境交互处理复杂任务的智能体提供了有前景的基础。现有方法要么要求这些智能体模仿专家提供的轨迹,要么依赖于确定性环境反馈进行强化学习,这限制了它们在游戏或代码生成等特定场景中的应用。本文提出了一种基于LLM智能体的新型训练方法,该方法利用来自评判LLM的弱监督信号,无需专家轨迹或确定性反馈。我们的智能体以迭代方式进行训练:首先通过环境交互生成轨迹,随后由评判LLM筛选出优质轨迹子集,并利用这些轨迹更新智能体,使其在后续迭代中生成更优轨迹。在API-bank数据集上的大量测试表明,尽管使用参数量显著减少的开源模型,我们的智能体能力持续提升,且性能与GPT-4相当。