Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
翻译:冷启动个性化要求在缺乏用户特定历史数据的情况下,通过交互推断用户偏好。其核心挑战是一个路由问题:每个任务涉及数十个偏好维度,但个体用户仅关注其中少数维度,且具体哪些维度相关取决于提问对象。在有限提问次数约束下,无结构化提问将遗漏关键维度。强化学习虽为自然形式化方法,但在多轮交互场景中,其终端奖励无法利用偏好数据固有的因子化、按准则划分的结构特性,实践中习得的策略会退化为忽略用户响应的静态提问序列。我们提出将冷启动偏好获取分解为离线结构学习与在线贝叶斯推断。Pep(基于先验的偏好获取)框架首先从完整用户画像中离线学习偏好关联的结构化世界模型,随后在线执行无训练的贝叶斯推断以选择信息量最大的问题,并预测包括未询问维度在内的完整偏好画像。该框架对下游求解器具有模块化特性,仅需简单的信念模型。在医学、数学、社会及常识推理领域的实验中,Pep生成响应与用户声明偏好的对齐度达80.8%,显著优于强化学习的68.5%,且交互次数减少3-5倍。当不同用户对同一问题给出相异答案时,Pep在39-62%的情况下会调整后续提问策略,而强化学习仅0-28%。该效果仅需约1万参数实现,而强化学习需要80亿参数,这表明冷启动偏好获取的关键瓶颈在于利用偏好数据因子化结构的能力。