In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.
翻译:本研究旨在通过零样本提示优化提升大型语言模型(LLMs)的算术推理能力。我们识别出此类优化中一个先前被忽视的目标——查询依赖性,并阐明由此引发的两个阻碍提示优化技术成功且经济设计的挑战。首要问题在于缺乏在推理阶段(正确答案不可用时)有效评估提示的方法。同时,通过与LLMs交互以探索广阔的提示自然语言空间,其学习过程被证明资源消耗巨大。为此,我们提出Prompt-OIRL,利用离线逆强化学习从离线提示演示数据中提取洞察。此类数据是当不同提示在公开数据集上基准测试时作为副产品产生的。通过Prompt-OIRL,查询相关的提示优化目标通过首先学习一个离线奖励模型来实现:该模型无需访问LLMs即可评估任意查询-提示对。随后采用最佳N选策略(Best-of-N)推荐最优提示。我们在不同规模LLMs及算术推理数据集上的实验评估,验证了所提方法的有效性与经济可行性。