Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.
翻译:偏好反馈赌博机为优化未知目标函数提供了一种强大工具,仅允许成对比较而非直接值查询。该模型可将人类反馈融入在线推断与优化过程,已应用于微调大语言模型的系统中。在线性目标函数或有限小域等简化设定下,该问题已得到充分理解,但实际应用价值受限。为进一步推进研究,我们考虑无限域与非线性的核化奖励函数。在此设定下,动作对的选择极具挑战性,需要在两个层面平衡探索与利用:动作对内部层面与算法迭代层面。我们提出MAXMINLCB算法,通过零和斯塔克尔伯格博弈模拟这种权衡机制,选择兼具信息量与高奖励潜力的动作对。MAXMINLCB在持续超越现有算法的同时,满足任意时间有效的速率最优遗憾保证。这得益于我们提出的基于核化逻辑估计器的偏好置信序列新方法。