While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets Per-MPST and Per-DOC for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model PERSE to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, PERSE predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that PERSE outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy. Both datasets and code will be released.
翻译:尽管大型语言模型(LLMs)在问答和检索等客观任务上表现出色,但在开放式文本生成评估中仍面临挑战,主要原因包括:(1)数据污染;(2)多维评估标准;(3)评审者个人偏好导致的主观性。为解决这些问题,我们提出在无污染的开放式生成评估中建模个性化。通过重新利用现有数据集,在适当匿名化并添加新的个性化标签后,我们构建了两个新数据集Per-MPST和Per-DOC,用于个性化故事评估。我们进一步开发了个性化故事评估模型PERSE,能够推断评审者偏好并提供个性化评估。具体而言,给定某位评审者的若干示例评论,PERSE可预测该评审者对新文本输入的详细评论或(在趣味性、惊喜度等多个维度上的)细粒度对比。实验结果显示,在故事评分的肯德尔相关系数上,PERSE比GPT-4提升15.8%;在成对偏好预测准确率上提升13.7%。两个数据集及代码将公开发布。