RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models

Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users' preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.

翻译：评估推荐系统的质量对于算法设计与优化至关重要。由于在线实验通常风险高且耗时，大多数评估方法基于离线指标进行计算，以实现快速算法迭代。然而，离线评估通常无法完全反映用户对不同推荐算法结果的偏好，其结果可能与在线A/B测试不一致。此外，许多离线指标（如AUC）无法为比较两个竞争性推荐系统在不同方面的细微差异提供充分信息，这可能导致长期在线服务中出现显著的性能差异。幸运的是，得益于大语言模型（LLMs）强大的常识知识与角色扮演能力，获取对离线推荐结果的模拟用户反馈成为可能。受LLM Chatbot Arena的启发，本文提出RecSys Arena的构想：在每个会话中，由两种不同推荐系统给出的推荐结果通过一个LLM评判器进行评估，以获得细粒度的评估反馈。具体而言，对于每个样本，我们使用LLM基于用户行为历史或现成的画像特征生成用户画像描述，该描述用于引导LLM扮演该用户角色，并评估其对不同模型生成的两种推荐结果的相对偏好。通过在两个不同场景的推荐数据集上进行大量实验，我们证明多种不同的LLMs不仅能提供与经典离线指标高度一致的总体评估结果，还能在诸多主观维度提供丰富的洞察。此外，该方法能更好地区分在AUC和nDCG指标上性能相近的不同算法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日