Large language models (LLMs) are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break down the problem into smaller reasoning steps (Wei et al., 2022), or ensembling various generations through modifying decoding steps (Wang et al., 2023) boosts performance. Current methods assume that the input prompt is fixed and expect the decoding strategies to introduce the diversity needed for ensembling. In this work, we relax this assumption and discuss how one can create and leverage variations of the input prompt as a means to diversity of thought to improve model performance. We propose a method that automatically improves prompt diversity by soliciting feedback from the LLM to ideate approaches that fit for the problem. We then ensemble the diverse prompts in our method DIV-SE (DIVerse reasoning path Self-Ensemble) across multiple inference calls. We also propose a cost-effective alternative where diverse prompts are used within a single inference call; we call this IDIV-SE (In-call DIVerse reasoning path Self-Ensemble). Under a fixed generation budget, DIV-SE and IDIV-SE outperform the previously discussed baselines using both GPT-3.5 and GPT-4 on several reasoning benchmarks, without modifying the decoding process. Additionally, DIV-SE advances state-of-the-art performance on recent planning benchmarks (Valmeekam et al., 2023), exceeding the highest previously reported accuracy by at least 29.6 percentage points on the most challenging 4/5 Blocksworld task. Our results shed light on how to enforce prompt diversity toward LLM reasoning and thereby improve the pareto frontier of the accuracy-cost trade-off.
翻译:大型语言模型(LLMs)在需要复杂推理的场景中表现欠佳。然而,通过指示模型将问题分解为更小的推理步骤(Wei等人,2022),或通过修改解码步骤集成多种生成结果(Wang等人,2023),可提升其性能。现有方法假设输入提示固定不变,并期望解码策略引入集成所需的多样性。本研究放宽了这一假设,探讨如何创建和利用输入提示的变体作为思维多样性手段来提升模型性能。我们提出一种方法,通过让LLM对问题反馈构思方案来自动提升提示多样性,并据此将多样化的提示集成到我们的DIV-SE(多样推理路径自集成)方法中,在多次推理调用间进行集成。此外,我们还提出一种经济高效的替代方案——IDIV-SE(单调用多样推理路径自集成),在单次推理调用中使用多样化提示。在固定生成预算下,DIV-SE和IDIV-SE在多个推理基准测试中(使用GPT-3.5和GPT-4)均优于此前讨论的基线方法,且无需修改解码过程。更值得注意的是,DIV-SE在最新规划基准测试(Valmeekam等人,2023)中提升了现有最佳性能,在最具挑战性的4/5 Blocksworld任务上,准确率较此前最高报告值至少提升29.6个百分点。本研究揭示了如何通过增强提示多样性来提升LLM推理能力,从而改进准确性-成本权衡的帕累托前沿。