With the rapid development of Large Language Models (LLMs), recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at https://github.com/JiangDeccc/EvaLLMasRecommender.
翻译:随着大语言模型(LLM)的快速发展,近期研究尝试将LLM作为推荐系统,为不同用户提供个性化信息服务。尽管已有努力致力于提升基于LLM的推荐模型的准确性,但在效用之外的维度上受到的关注相对较少。此外,基于LLM的推荐模型存在独特的评估方面,这些方面在很大程度上被忽视了。为填补这一空白,我们探索了四个新的评估维度,并提出了一个多维评估框架。这些新的评估维度包括:1)历史记录长度敏感性,2)候选项位置偏差,3)生成相关性能,以及4)幻觉。所有这四个维度都可能影响性能,但在传统系统中基本无需考虑。利用这一多维评估框架,结合传统评估方面,我们评估了七种基于LLM的推荐模型(采用三种提示策略)的性能,并在四个数据集上,将它们在排序和重排序任务中的表现与六种传统模型进行了比较。我们发现,在排序场景中,LLM擅长处理具有先验知识和较短输入历史记录的任务;在重排序场景中,LLM表现更佳,在多个维度上超越了传统模型。然而,LLM表现出显著的候选项位置偏差问题,且某些模型相比其他模型更频繁地幻觉出不存在的项目。我们希望本评估框架及观察结果能为未来关于LLM作为推荐系统的研究提供有益参考。代码与数据可在 https://github.com/JiangDeccc/EvaLLMasRecommender 获取。