While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets Per-MPST and Per-DOC for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. We further develop a personalized story evaluation model PERSE to infer reviewer preferences and provide a personalized evaluation. Specifically, given a few exemplary reviews from a particular reviewer, PERSE predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. Experimental results show that PERSE outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy. Both datasets and code will be released at https://github.com/dqwang122/PerSE.
翻译:尽管大型语言模型(LLMs)在问答和检索等较客观任务中展现出令人瞩目的成果,但评估其在开放式文本生成上的表现仍非易事,原因包括:(1)数据污染;(2)多维度的评估标准;(3)评审者个人偏好引发的主观性。为解决这些问题,我们提出在无污染的开放式生成评估中建模个性化。通过重新利用现有数据集,进行适当的匿名化处理并添加新的个性化标签,我们创建了两个用于个性化故事评估的新数据集:Per-MPST和Per-DOC。我们进一步开发了个性化故事评估模型PERSE,以推断评审者的偏好并提供个性化评估。具体而言,给定某位评审者的一些示例评论,PERSE能够预测该评审者对新的文本输入在趣味性、惊喜等多个方面的详细评论或细粒度比较。实验结果表明,PERSE在故事评分Kendall相关系数上比GPT-4高出15.8%,在成对偏好预测准确率上高出13.7%。两个数据集和代码将发布在 https://github.com/dqwang122/PerSE。