Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotations or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.
翻译:语言智能体通过借助基础模型进行推理,已展现出自主决策能力。近期,研究者尝试以多步推理与行动轨迹作为训练数据来提升语言智能体性能。然而,此类轨迹的收集仍需大量人工投入——无论是通过人工标注还是实现多样化提示框架。本文提出A$^3$T框架,该框架能够以ReAct风格实现智能体轨迹的自主标注。其核心组件是一个ActRe提示智能体,可为任意行动提供决策理由。当随机采样外部行动时,ReAct风格智能体可向ActRe智能体查询该行动对应的文本化推理依据。通过将ActRe生成的后验推理附加至采样行动之前,即可合成新轨迹。由此,ReAct风格智能体可为失败任务执行多条轨迹,并选取成功轨迹以补充失败轨迹,实现对比自训练。基于二值化奖励的策略梯度方法,积累轨迹的对比自训练可形成闭环,支持语言智能体多轮自我改进。我们采用开源Mistral-7B-Instruct-v0.2模型进行QLoRA微调实验。在AlfWorld环境中,A$^3$T训练智能体的单次尝试成功率达96%,经4轮迭代后可达100%成功率。在WebShop环境中,A$^3$T智能体的单次尝试表现与人类平均水平相当,经4轮迭代优化后性能接近人类专家。A$^3$T智能体显著超越现有技术,包括GPT-4提示驱动方法、先进智能体框架及全参数微调大语言模型。