Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn from their exploration failures. This leads to improved performance through an iterative optimization framework. During the exploration phase, the agent interacts with the environment while completing given tasks, gathering failure trajectories to create contrastive trajectory pairs. In the subsequent training phase, the agent utilizes these trajectory preference pairs to update its policy using contrastive learning methods like DPO. This iterative cycle of exploration and training fosters continued improvement in the agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin. Furthermore, an examination of task-solving efficiency and potential in scenarios lacking expert trajectory underscores the effectiveness of our approach.
翻译:大型语言模型已成为各类自主智能体系统的核心组成部分。本研究提出一种基于探索的轨迹优化方法,简称ETO。该学习方法旨在提升开放领域LLM智能体的性能。与先前仅基于成功专家轨迹进行训练的研究不同,本方法允许智能体从其探索失败中学习。通过迭代优化框架,这带来了性能的提升。在探索阶段,智能体与环境交互以完成给定任务,收集失败轨迹以构建对比轨迹对。在随后的训练阶段,智能体利用这些轨迹偏好对,通过如DPO等对比学习方法更新其策略。这种探索与训练的迭代循环促进了智能体的持续改进。我们在三项复杂任务上的实验表明,ETO始终以显著优势超越基线性能。此外,对任务解决效率的分析以及在缺乏专家轨迹场景下的潜力评估,进一步证实了我们方法的有效性。