Recently, the fast development of Large Language Models (LLMs) such as ChatGPT has significantly advanced NLP tasks by enhancing the capabilities of conversational models. However, the application of LLMs in the recommendation domain has not been thoroughly investigated. To bridge this gap, we propose LLMRec, a LLM-based recommender system designed for benchmarking LLMs on various recommendation tasks. Specifically, we benchmark several popular off-the-shelf LLMs, such as ChatGPT, LLaMA, ChatGLM, on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. Furthermore, we investigate the effectiveness of supervised finetuning to improve LLMs' instruction compliance ability. The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation. However, they demonstrated comparable performance to state-of-the-art methods in explainability-based tasks. We also conduct qualitative evaluations to further evaluate the quality of contents generated by different models, and the results show that LLMs can truly understand the provided information and generate clearer and more reasonable results. We aspire that this benchmark will serve as an inspiration for researchers to delve deeper into the potential of LLMs in enhancing recommendation performance. Our codes, processed data and benchmark results are available at https://github.com/williamliujl/LLMRec.
翻译:近期,以ChatGPT为代表的大型语言模型(LLMs)的快速发展显著提升了对话模型的能力,推动了自然语言处理任务的进步。然而,LLMs在推荐领域的应用尚未得到充分研究。为弥合这一差距,我们提出LLMRec——一个基于LLM的推荐系统,旨在对各类推荐任务中的大型语言模型进行基准测试。具体而言,我们在五个推荐任务上评估了多个主流现成LLM(包括ChatGPT、LLaMA、ChatGLM),涵盖评分预测、序列推荐、直接推荐、解释生成和评论摘要。此外,我们探究了监督微调对提升LLM指令遵循能力的有效性。基准测试结果表明:在序列推荐和直接推荐等精度导向型任务中,LLMs仅表现出中等水平;而在可解释性任务中,其性能可与最先进方法相媲美。我们还通过定性评估进一步比较不同模型生成内容的质量,结果显示LLMs能真正理解所提供信息并生成更清晰合理的结果。我们期望该基准测试能激励研究者深入探索LLMs在提升推荐性能方面的潜力。相关代码、处理后的数据及基准测试结果已开源于https://github.com/williamliujl/LLMRec。