Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.
翻译:个性化是数百万用户使用的对话式AI系统的标准功能;然而,学术研究中常采用模拟用户而非真实用户来评估个性化方法的有效性。这引发了两个关键问题:用户与模拟用户在交互模式及判断标准上存在何种差异?个性化是否应通过基于上下文的提示(prompting)还是基于权重的微调(fine-tuning)来实现?在本项大规模被试内实验中,我们从PRISM数据集(Kirk等,2024)中重新招募了来自52个国家的530名参与者,在距离其提交偏好数据两年后,通过盲法多轮对话评估了个性化与非个性化语言模型。研究发现:偏好微调(P-DPO,Li等,2024)显著优于通用模型与个性化提示,但基于个体偏好数据的适应相较于基于异质性人群聚合偏好训练仅带来边际收益。除长度偏差外,微调还会放大谄媚行为与关系寻求倾向——这些行为在短期评估中易获人类偏好奖励,却可能引发有害的长期后果。通过模拟用户复现该被试内实验虽能恢复聚合模型层级结构,但模拟器在个体判断上的表现远低于人类自一致性基线,其话题分布存在差异,展现出加剧的位置偏差,并形成与人类相异的反馈动力学特征。