GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

The effectiveness of large language models (LLMs) is closely tied to the design of prompts, making prompt optimization essential for enhancing their performance across a wide range of tasks. Many existing approaches to automating prompt engineering rely exclusively on textual feedback, refining prompts based solely on inference errors identified by large, computationally expensive LLMs. Unfortunately, smaller models struggle to generate high-quality feedback, resulting in complete dependence on large LLM judgment. Moreover, these methods fail to leverage more direct and finer-grained information, such as gradients, due to operating purely in text space. To this end, we introduce GReaTer, a novel prompt optimization technique that directly incorporates gradient information over task-specific reasoning. By utilizing task loss gradients, GReaTer enables self-optimization of prompts for open-source, lightweight language models without the need for costly closed-source LLMs. This allows high-performance prompt optimization without dependence on massive LLMs, closing the gap between smaller models and the sophisticated reasoning often needed for prompt refinement. Extensive evaluations across diverse reasoning tasks including BBH, GSM8k, and FOLIO demonstrate that GReaTer consistently outperforms previous state-of-the-art prompt optimization methods, even those reliant on powerful LLMs. Additionally, GReaTer-optimized prompts frequently exhibit better transferability and, in some cases, boost task performance to levels comparable to or surpassing those achieved by larger language models, highlighting the effectiveness of prompt optimization guided by gradients over reasoning. Code of GReaTer is available at https://github.com/psunlpgroup/GreaTer.

翻译：大型语言模型（LLM）的性能与提示设计密切相关，因此提示优化对于提升其在各类任务中的表现至关重要。现有自动化提示工程方法大多仅依赖文本反馈，即完全基于计算成本高昂的大型LLM识别出的推理错误来优化提示。然而，轻量级模型难以生成高质量反馈，导致这类方法完全依赖于大型LLM的评判。此外，由于纯文本空间的操作模式，这些方法未能利用更直接、更细粒度的信息（如梯度）。为此，我们提出GReaTer——一种直接融合任务特定推理梯度信息的新型提示优化技术。通过利用任务损失梯度，GReaTer使得开源轻量级语言模型能够在不依赖昂贵闭源LLM的情况下实现提示的自优化。该方法实现了高性能提示优化，无需依赖海量参数LLM，从而缩小了轻量级模型与提示优化所需复杂推理能力之间的差距。在BBH、GSM8k和FOLIO等多样化推理任务上的大量实验表明，GReaTer持续超越现有最先进的提示优化方法，包括那些依赖强大LLM的方案。此外，经GReaTer优化的提示通常表现出更好的可迁移性，在某些情况下甚至能将任务性能提升至与大型语言模型相当或更优的水平，这凸显了基于推理梯度的提示优化方法的有效性。GReaTer代码已开源：https://github.com/psunlpgroup/GreaTer。