Large language models (LLMs) demonstrate their promise in tackling complicated practical challenges by combining action-based policies with chain of thought (CoT) reasoning. Having high-quality prompts on hand, however, is vital to the framework's effectiveness. Currently, these prompts are handcrafted utilising extensive human labor, resulting in CoT policies that frequently fail to generalise. Human intervention is also required to develop grounding functions that ensure low-level controllers appropriately process CoT reasoning. In this paper, we propose a comprehensive training framework for complex task-solving, incorporating human prior knowledge into the learning of action policies. To that purpose, we offer a new leader-follower bilevel framework that is capable of learning to ask relevant questions (prompts) and subsequently undertaking reasoning to guide the learning of actions. The prompt policy is employed to make introspective revisions based on historical findings, leading the CoT process to consider the anticipated goals and generate outputs that lead to decisive, high-performing actions. The action policy subsequently learns to comprehend and integrate the CoT outputs to take actions. Our empirical data reveal that our framework outperforms leading methods in $5$ decision-making tasks such as Overcooked and FourRoom.
翻译:大语言模型(LLMs)通过将基于行动的策略与思维链(CoT)推理相结合,展示出在解决复杂实际挑战方面的潜力。然而,拥有高质量的提示对于框架的有效性至关重要。目前,这些提示依赖大量人工劳动精心设计,导致CoT策略常常无法泛化。同时,还需要人工干预来开发接地函数,以确保底层控制器正确处理CoT推理。本文提出了一种用于复杂任务求解的全面训练框架,将人类先验知识融入行动策略的学习中。为此,我们提出了一种新的领导者-追随者双层框架,能够学习提出相关问题(提示),随后进行推理以指导行动的学习。提示策略用于基于历史发现进行内省式修订,引导CoT过程考虑预期目标,并生成导致果断、高性能行动的输出。行动策略随后学习理解和整合CoT输出以采取行动。我们的实验数据表明,该框架在5个决策任务(如Overcooked和FourRoom)中优于主流方法。