Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.
翻译:尽管兴趣日益增长,但对大语言模型(LLM)个性化能力的大多数评估仍依赖合成数据。当前个性化系统对真实用户的实际效果尚不明确。本文研究了LLM在使用合成数据与人类数据时的个性化性能差异。我们收集了人类对话(550段对话)以及个性化三个阶段的标注判断:从对话中提取用户属性(5,949个判断)、将相关属性与新提示匹配(11,919个判断)、将相关属性整合到个性化回复中(1,101个判断)。引入人类数据揭示了各阶段的系统局限性:模型难以从人类对话中提取属性,与人类对相关性属性的判断存在分歧,且生成的个性化回复被人类评价为不如通用回复(尽管LLM自身评分普遍认为更优)。我们针对前两个阶段提出了两种轻量级训练干预方法,使自动化个性化评估更接近人类数据表现。但在第三阶段发现,学习型奖励模型与人类评分的相关性仅达中等水平,这表明与人类对齐的个性化质量判断难以直接建模。本研究收集的数据为探索模型如何以人类认可的方式提取、选择及整合用户信息奠定了基础。