Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.
翻译:大型语言模型(LLMs)已被证明在长期规划方面存在困难,这可能源于其探索可能解空间的方式存在局限。我们提出一种架构,其中强化学习(RL)智能体引导LLM进行空间探索:(1)该智能体能够获取领域特定信息,因此可根据具体相关指标(这些指标未在LLM训练目标中明确考虑)对候选解的质量进行决策;(2)LLM可专注于生成即时后续步骤,无需进行长期规划。我们通过探索替代路径和回溯机制实现非线性推理。我们在程序等价性任务上评估该架构,并与思维链(CoT)及思维树(ToT)方法进行对比。我们同时评估了下游任务(即二分类任务)和中间推理步骤。实验表明,我们的方法相较于CoT和ToT具有优势。