Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users' preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.
翻译:评估推荐系统的质量对于算法设计与优化至关重要。由于在线实验通常风险高且耗时,大多数评估方法基于离线指标进行计算,以实现快速算法迭代。然而,离线评估通常无法完全反映用户对不同推荐算法结果的偏好,其结果可能与在线A/B测试不一致。此外,许多离线指标(如AUC)无法为比较两个竞争性推荐系统在不同方面的细微差异提供充分信息,这可能导致长期在线服务中出现显著的性能差异。幸运的是,得益于大语言模型(LLMs)强大的常识知识与角色扮演能力,获取对离线推荐结果的模拟用户反馈成为可能。受LLM Chatbot Arena的启发,本文提出RecSys Arena的构想:在每个会话中,由两种不同推荐系统给出的推荐结果通过一个LLM评判器进行评估,以获得细粒度的评估反馈。具体而言,对于每个样本,我们使用LLM基于用户行为历史或现成的画像特征生成用户画像描述,该描述用于引导LLM扮演该用户角色,并评估其对不同模型生成的两种推荐结果的相对偏好。通过在两个不同场景的推荐数据集上进行大量实验,我们证明多种不同的LLMs不仅能提供与经典离线指标高度一致的总体评估结果,还能在诸多主观维度提供丰富的洞察。此外,该方法能更好地区分在AUC和nDCG指标上性能相近的不同算法。