Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.
翻译:为语言模型(LM)寻找有效的提示至关重要,但众所周知十分困难:提示空间在组合意义上是巨大的,由于目标语言模型评估成本高昂,奖励信号稀疏。然而,现有的基于强化学习的提示优化器通常依赖于在线策略更新和从固定分布中采样的元提示,导致样本效率低下。我们提出了GFlowPO,一个概率性提示优化框架,它将提示搜索构建为一个关于潜在提示的后验推断问题,并通过一个元提示化的参考语言模型先验进行正则化。第一步,我们使用离策略的生成流网络(GFlowNet)目标对轻量级的提示语言模型进行微调,采用一种基于回放的训练策略,该策略重用过去的提示评估以实现样本高效的探索。第二步,我们引入了动态内存更新(DMU),这是一种免训练的机制,它通过注入(i)来自回放缓冲区的多样化提示和(ii)来自小型优先级队列的顶级性能提示来更新元提示,从而逐步将搜索过程集中在高奖励区域。在少样本文本分类、指令归纳基准测试和问答任务中,GFlowPO始终优于近期的离散提示优化基线方法。