In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.
翻译:本研究旨在通过零样本提示词优化来增强大型语言模型(LLMs)的算术推理能力。我们识别出该类优化中一个先前被忽视的目标——查询依赖特性,并阐明由此引发的两大挑战,这些挑战阻碍了提示词优化技术成功且经济地设计。首要问题在于:当无法获取标准答案时,缺乏有效的推理阶段提示词评估方法。同时,通过与LLMs交互学习来探索广阔的提示词空间,被证明是资源密集型的。为此,我们提出Prompt-OIRL方法,该方法利用离线逆向强化学习从离线提示演示数据中提取洞见。这类数据作为副产品存在于各类提示词在可公开访问数据集上被基准测试的过程中。通过Prompt-OIRL,查询依赖的提示词优化目标通过以下步骤实现:首先学习一个离线奖励模型,该模型无需访问LLMs即可评估任意查询-提示词对;随后采用最佳N选一策略推荐最优提示词。我们在多种规模LLMs及算术推理数据集上的实验评估,充分验证了所提方法的有效性和经济可行性。