Recent advancements in large language models (LLMs) have shown potential for human-like agents. To help these agents adapt to new tasks without extensive human supervision, we propose the Learning through Communication (LTC) paradigm, a novel training approach enabling LLM agents to improve continuously through interactions with their environments and other agents. Recent advancements in large language models (LLMs) have shown potential for human-like agents. To help these agents adapt to new tasks without extensive human supervision, we propose the Learning through Communication (LTC) paradigm, a novel training approach enabling LLM agents to improve continuously through interactions with their environments and other agents. Through iterative exploration and PPO training, LTC empowers the agent to assimilate short-term experiences into long-term memory. To optimize agent interactions for task-specific learning, we introduce three structured communication patterns: Monologue, Dialogue, and Analogue-tailored for common tasks such as decision-making, knowledge-intensive reasoning, and numerical reasoning. We evaluated LTC on three datasets: ALFWorld (decision-making), HotpotQA (knowledge-intensive reasoning), and GSM8k (numerical reasoning). On ALFWorld, it exceeds the instruction tuning baseline by 12% in success rate. On HotpotQA, LTC surpasses the instruction-tuned LLaMA-7B agent by 5.1% in EM score, and it outperforms the instruction-tuned 9x larger PaLM-62B agent by 0.6%. On GSM8k, LTC outperforms the CoT-Tuning baseline by 3.6% in accuracy. The results showcase the versatility and efficiency of the LTC approach across diverse domains. We will open-source our code to promote further development of the community.
翻译:近期大型语言模型(LLMs)的进展展现了类人智能体的潜力。为帮助这些智能体在无需大量人工监督的情况下适应新任务,我们提出"通过通信学习"(LTC)范式——一种新颖的训练方法,使LLM智能体能够通过与环境和其它智能体的交互持续改进。通过迭代探索与PPO训练,LTC使智能体能够将短期经验整合为长期记忆。为优化面向特定任务的智能体交互,我们引入三种结构化通信模式:独白(Monologue)、对话(Dialogue)和类比(Analogue),分别针对决策、知识密集型推理和数值推理等常见任务。我们在三个数据集上评估了LTC:ALFWorld(决策)、HotpotQA(知识密集型推理)和GSM8k(数值推理)。在ALFWorld上,其成功率比指令微调基线高出12%;在HotpotQA上,LTC的精确匹配(EM)得分比经过指令微调的LLaMA-7B智能体高5.1%,且比经过指令微调的9倍规模PaLM-62B智能体高0.6%;在GSM8k上,LTC的准确率比CoT-Tuning基线高出3.6%。结果展示了LTC方法在不同领域的通用性和高效性。我们将开源代码以促进社区的进一步发展。