The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
翻译:大语言模型(LLM)解决任务的效果在很大程度上取决于指令质量,而优质指令通常需要大量人工进行微调。这凸显了自动化指令优化的必要性;然而,在处理黑盒LLM时,由于无法获取模型参数与梯度,此类优化尤为困难。我们提出ACING——一种将任务特定提示优化构建为无状态连续动作强化学习(RL)问题的方法,该问题形式被称为连续赌博机设定。ACING采用基于行动者-评论家的方法,通过不可微分的奖励信号学习来优化提示。我们在30项基于指令的任务上对ChatGPT进行提示优化,验证了ACING的有效性。ACING在所有基准方法中均表现优异,中位数得分提升达10个百分点。此外,ACING不仅能够复现人工设计的专家指令,更能超越其性能,在人类基准对比中最高可实现39个百分点的性能提升。