When the quality of naive prompts is carefully optimized by human experts, the task performance of large language models (LLMs) can be significantly improved. However, expert-based prompt optimizations are expensive. Herein, some works have proposed Automatic Prompt Optimization (APO), to optimize naive prompts according to task outputs of given in-box testing models, with the help of advanced LLMs (e.g., GPT-4) in an ad-hoc way. Although effective, existing schemes suffer from poor generalization ability and privacy risk. To this end, we collect the first large-scale Prompt Optimization Preference dataset (POP), fine-tune offline local LLM-based optimizers, then fairly test with various downstream models. Our method allows accurate optimization of the core task instruction part within the naive prompt in a model-agnostic manner, and thus is named Free-from Instruction-oriented Prompt Optimization (FIPO). In specific, FIPO uses a modular APO template that dynamically integrate the naive task instruction, optional instruction responses, and optional ground truth to produce finely optimized prompts. The POP dataset is meticulously constructed using advanced LLMs, undergoing rigorous cross-validation by human experts and analytical models. Leveraging insights from the data with Tulu2 models and diverse fine-tuning strategies, we validate the efficacy of FIPO framework across five public benchmarks and six testing models. Check codes and data here: https://github.com/LuJunru/FIPO_Project.
翻译:当人类专家精心优化原始提示的质量时,大语言模型的任务性能可获得显著提升。然而,基于专家的提示优化成本高昂。为此,部分研究提出了自动提示优化方法,借助先进大语言模型以临时方式,根据给定内置测试模型的任务输出对原始提示进行优化。尽管现有方案有效,但其泛化能力不足且存在隐私风险。为此,我们收集了首个大规模提示优化偏好数据集,基于离线本地大语言模型微调优化器,并在多种下游模型中开展公平测试。本方法能以模型无关的方式精准优化原始提示中的核心任务指令部分,故命名为自由形式指令导向提示优化。具体而言,FIPO采用模块化自动提示优化模板,动态整合原始任务指令、可选指令响应及可选真实标注,以生成精细优化的提示。该数据集通过先进大语言模型精心构建,并经过人类专家与分析模型的严格交叉验证。基于Tulu2模型和多样化微调策略对数据洞见的利用,我们在五个公共基准测试和六个测试模型中验证了FIPO框架的有效性。代码与数据详见:https://github.com/LuJunru/FIPO_Project。