In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
翻译:近年来,将大型语言模型(LLMs)整合到推荐系统中为提升推荐质量创造了新的机遇。然而,目前尚缺乏一个全面的基准来深入评估和比较LLMs与传统推荐系统的推荐能力。本文提出了RecBench,该系统性地研究了多种物品表示形式(包括唯一标识符、文本、语义嵌入和语义标识符),并评估了点击率预测(CTR)和序列推荐(SeqRec)两项主要推荐任务。我们的大规模实验涵盖了多达17个大型模型,并在来自时尚、新闻、视频、图书和音乐领域的五个多样化数据集上进行。研究结果表明,基于LLM的推荐系统优于传统推荐系统,在CTR场景中实现了高达5%的AUC提升,在SeqRec场景中实现了高达170%的NDCG@10提升。然而,这些显著的性能提升是以推理效率大幅下降为代价的,使得LLM作为推荐系统的范式在实时推荐环境中不切实际。我们希望我们的发现能够启发未来的研究,包括面向推荐任务的模型加速方法。我们将发布代码、数据、配置和平台,以便其他研究人员能够复现并基于我们的实验结果进行进一步研究。