超越效用：评估作为推荐系统的大语言模型 (Beyond Utility: Evaluating LLM as Recommender)

With the rapid development of Large Language Models (LLMs), recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at https://github.com/JiangDeccc/EvaLLMasRecommender.

翻译：随着大语言模型（LLM）的快速发展，近期研究尝试将LLM作为推荐系统，为不同用户提供个性化信息服务。尽管已有努力致力于提升基于LLM的推荐模型的准确性，但在效用之外的维度上受到的关注相对较少。此外，基于LLM的推荐模型存在独特的评估方面，这些方面在很大程度上被忽视了。为填补这一空白，我们探索了四个新的评估维度，并提出了一个多维评估框架。这些新的评估维度包括：1）历史记录长度敏感性，2）候选项位置偏差，3）生成相关性能，以及4）幻觉。所有这四个维度都可能影响性能，但在传统系统中基本无需考虑。利用这一多维评估框架，结合传统评估方面，我们评估了七种基于LLM的推荐模型（采用三种提示策略）的性能，并在四个数据集上，将它们在排序和重排序任务中的表现与六种传统模型进行了比较。我们发现，在排序场景中，LLM擅长处理具有先验知识和较短输入历史记录的任务；在重排序场景中，LLM表现更佳，在多个维度上超越了传统模型。然而，LLM表现出显著的候选项位置偏差问题，且某些模型相比其他模型更频繁地幻觉出不存在的项目。我们希望本评估框架及观察结果能为未来关于LLM作为推荐系统的研究提供有益参考。代码与数据可在 https://github.com/JiangDeccc/EvaLLMasRecommender 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日