The recent advances in the development of Large Language Models (LLMs) like ChatGPT have achieved remarkable performance by leveraging human expertise. Yet, fully eliciting LLMs' potential for complex tasks requires navigating the vast search space of natural language prompts. While prompt engineering has shown promise, the requisite human-crafted prompts in trial-and-error attempts and the associated costs pose significant challenges. Crucially, the efficiency of prompt optimization hinges on the costly procedure of prompt evaluation. This work introduces Prompt-OIRL, an approach rooted in offline inverse reinforcement learning that seeks to bridge the gap between effective prompt evaluation and affordability. Our method draws on offline datasets from expert evaluations, employing Inverse-RL to derive a reward model for offline, query-dependent prompt evaluations. The advantages of Prompt-OIRL are manifold: it predicts prompt performance, is cost-efficient, produces human-readable results, and efficiently navigates the prompt space. We validate our method across four LLMs and three arithmetic datasets, highlighting its potential as a robust and effective tool for offline prompt evaluation and optimization. Our code as well as the offline datasets are released, and we highlight the Prompt-OIRL can be reproduced within a few hours using a single laptop using CPU
翻译:近年来,以ChatGPT为代表的大语言模型(LLMs)通过利用人类专业知识取得了显著性能。然而,充分挖掘LLMs在复杂任务中的潜力需要探索自然语言提示的庞大搜索空间。尽管提示工程已展现出潜力,但人工试错式提示设计的高昂成本成为重大挑战。关键在于,提示优化的效率受制于成本高昂的评估流程。本研究提出Prompt-OIRL方法,这是一种基于离线逆强化学习的框架,旨在弥合有效提示评估与成本可控性之间的鸿沟。该方法利用专家评估的离线数据集,通过逆强化学习推导奖励模型,实现离线、查询相关的提示评估。Prompt-OIRL具有多重优势:可预测提示性能、成本高效、生成人类可读结果,并能高效遍历提示空间。我们在四个LLMs和三个算术数据集上验证了该方法,突显其作为稳健且有效的离线提示评估与优化工具的潜力。我们已公开代码与离线数据集,并强调Prompt-OIRL可在单台CPU笔记本电脑上数小时内完成复现。