Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.
翻译:大语言模型(LLM)已展现出卓越的语言能力,但在独立解决交互任务时仍面临挑战。现有方法或依赖于梯度访问(如GPT-4等先进LLM通常无法提供),或需要多样且高质量的上下文示例。本研究提出LLM-PO这一创新方法,使LLM无需梯度访问或大量示例即可处理此类任务。其核心思想是维护基于文本的计划,并引导LLM基于当前计划积累的经验反思其优缺点,继而更新计划,再通过新计划收集更多经验。在HotpotQA上的实验表明,相较于上下文学习(ICL)基线方法,LLM-PO在保持更高或相当的成功率的同时,推理成本更低。