On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison

User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations. Also a real-life example of preference learning application is provided.

翻译：用户偏好学习通常是一项困难的问题。个体偏好甚至对用户自身来说也往往未知，而选择空间是无限的。本文从信息论角度研究用户偏好学习。我们将偏好学习建模为两个相互作用的子系统：一个代表具有其偏好的用户，另一个代表需要学习这些偏好的智能体。用户及其行为通过参数化偏好函数建模。为高效学习偏好并快速缩小搜索空间，我们提出智能体与用户交互以收集最具信息量的数据。该智能体向用户呈现两个待评估的方案，用户根据其偏好函数对其进行评分。我们证明，数据收集与偏好学习的最优智能体策略，是真实用户响应预测分布与智能体分配的用户响应预测分布之间归一化加权Kullback-Leibler (KL)散度的极大极小优化结果。该KL散度值（我们将其称为剩余系统不确定性）可在无真实标签情况下提供有效性能度量。该度量刻画了智能体预测用户的能力，进而体现潜在学习（偏好）模型的优劣。我们提出的智能体包含用户模型推理与方案生成的序贯机制。为推断用户模型（偏好函数），智能体采用贝叶斯近似推理。数据收集策略旨在生成最能消除用户响应预测不确定性的方案。通过数值仿真验证了方法的有效性，并给出了偏好学习的实际应用案例。