Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality--cost trade-offs under constrained comparison budgets.
翻译:大型语言模型(LLM)对提示高度敏感,但大多数自动提示优化(APO)方法假设能访问耗时且成本高昂的真实参考(如有标签的验证数据)。我们提出提示对弈优化器(PDO),这是一种基于LLM评判器的成对偏好反馈、面向无标签提示优化的样本高效框架。PDO将提示选择建模为对弈赌博机问题,融合了:(i)双汤普森采样机制,在固定评判预算下优先选择信息量大的比较对;(ii)最优提示引导的变异策略,在扩展候选池的同时剪除弱提示。在BIG-bench Hard (BBH)和MS MARCO上的实验表明,PDO能持续选出优于无标签基线的提示,且在受限比较预算下实现有利的质量-成本权衡。