Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

翻译：大型语言模型（LLM）在一系列文本生成任务中展现出卓越能力。然而，LLM在处理需要多步决策和环境反馈的问题时仍存在困难，例如在线购物、科学推理和数学问题求解。与纯文本数据不同，大规模决策数据的收集具有挑战性。此外，许多强大的LLM仅能通过API访问，这因成本与复杂性限制了其在智能体任务中的微调。为突破LLM智能体的局限性，我们提出一种无需人工标注即可从环境中自动学习奖励模型的框架。该模型可用于评估LLM智能体的动作轨迹，并为任务规划提供启发式引导。具体而言，我们的方法首先部署一个基于LLM的智能体在环境中随机探索，生成多样化的动作轨迹；随后利用另一个LLM为每条轨迹分配任务意图，并合成正确响应与错误响应对。这些三元组（任务意图、正例响应、负例响应）被用作训练数据，以优化能够对动作轨迹进行评分的奖励模型。我们在不同智能体基准测试上的评估结果证明了该框架的有效性与泛化能力。综上所述，本研究所提框架在增强LLM智能体决策能力方面取得了重要进展。通过自动化学习奖励模型，我们克服了数据稀缺与API限制的挑战，有望革新LLM在复杂交互环境中的应用。这项研究为开发能够应对现实世界中多步决策问题的先进人工智能智能体开辟了新路径。