Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality--cost trade-offs under constrained comparison budgets.
翻译:大型语言模型(LLM)对提示词高度敏感,但大多数自动提示优化(APO)方法都假设能够获取成本高昂的真实参考数据(例如带标注的验证数据)。我们提出了提示对决优化器(Prompt Duel Optimizer, PDO),这是一个基于LLM评判器成对偏好反馈的、样本高效的无需标注提示优化框架。PDO将提示选择建模为一个对决赌博机问题,并融合了两种机制:(i)双重汤普森采样,用于在固定的评判预算下优先处理信息量丰富的比较;(ii)基于顶级表现者的引导变异,用于扩展候选提示池并剪枝弱提示。在BIG-bench Hard(BBH)和MS MARCO数据集上的实验表明,PDO始终能比无需标注的基线方法识别出更强的提示,同时在受限的比较预算下提供了有利的质量-成本权衡。