When the quality of naive prompts is carefully optimized by human experts, the task performance of large language models (LLMs) can be significantly improved. However, expert-based prompt optimizations are expensive. Herein, some works have proposed Automatic Prompt Optimization (APO), to optimize naive prompts according to task outputs of given in-box testing models, with the help of advanced LLMs (e.g., GPT-4) in an ad-hoc way. Although effective, existing schemes suffer from poor generalization ability and privacy risk. To this end, we collect the first large-scale Prompt Optimization Preference dataset (POP), fine-tune offline local LLM-based optimizers, then fairly test with various downstream models. Our method allows accurate optimization of the core task instruction part within the naive prompt in a model-agnostic manner, and thus is named Free-from Instruction-oriented Prompt Optimization (FIPO). In specific, FIPO uses a modular APO template that dynamically integrate the naive task instruction, optional instruction responses, and optional ground truth to produce finely optimized prompts. The POP dataset is meticulously constructed using advanced LLMs, undergoing rigorous cross-validation by human experts and analytical models. Leveraging insights from the data with Tulu2 models and diverse fine-tuning strategies, we validate the efficacy of FIPO framework across five public benchmarks and six testing models. Check codes and data here: https://github.com/LuJunru/FIPO_Project.
翻译:当人类专家精心优化原始提示的质量时,大型语言模型(LLM)的任务性能可获得显著提升。然而,基于专家的提示优化成本高昂。为此,部分研究提出了自动提示优化(APO)方法,即借助先进LLM(如GPT-4)以临时方式,根据给定内置测试模型的任务输出来优化原始提示。尽管现有方案有效,但其泛化能力较差且存在隐私风险。为此,我们收集了首个大规模提示优化偏好数据集(POP),基于离线本地LLM微调优化器,并在多种下游模型中进行了公平测试。本方法能以模型无关的方式精准优化原始提示中的核心任务指令部分,故命名为自由形式指令导向提示优化(FIPO)。具体而言,FIPO采用模块化APO模板,动态整合原始任务指令、可选指令响应及可选真实标注,以生成精细化优化后的提示。POP数据集通过先进LLM精心构建,并经过人类专家与分析模型的严格交叉验证。基于Tulu2模型与多样化微调策略对数据洞察的利用,我们在五个公共基准测试和六个测试模型中验证了FIPO框架的有效性。代码与数据详见:https://github.com/LuJunru/FIPO_Project。