Evaluating ChatGPT as a Recommender System: A Rigorous Approach

Recent popularity surrounds large AI language models due to their impressive natural language capabilities. They contribute significantly to language-related tasks, including prompt-based learning, making them valuable for various specific tasks. This approach unlocks their full potential, enhancing precision and generalization. Research communities are actively exploring their applications, with ChatGPT receiving recognition. Despite extensive research on large language models, their potential in recommendation scenarios still needs to be explored. This study aims to fill this gap by investigating ChatGPT's capabilities as a zero-shot recommender system. Our goals include evaluating its ability to use user preferences for recommendations, reordering existing recommendation lists, leveraging information from similar users, and handling cold-start situations. We assess ChatGPT's performance through comprehensive experiments using three datasets (MovieLens Small, Last.FM, and Facebook Book). We compare ChatGPT's performance against standard recommendation algorithms and other large language models, such as GPT-3.5 and PaLM-2. To measure recommendation effectiveness, we employ widely-used evaluation metrics like Mean Average Precision (MAP), Recall, Precision, F1, normalized Discounted Cumulative Gain (nDCG), Item Coverage, Expected Popularity Complement (EPC), Average Coverage of Long Tail (ACLT), Average Recommendation Popularity (ARP), and Popularity-based Ranking-based Equal Opportunity (PopREO). Through thoroughly exploring ChatGPT's abilities in recommender systems, our study aims to contribute to the growing body of research on the versatility and potential applications of large language models. Our experiment code is available on the GitHub repository: https://github.com/sisinflab/Recommender-ChatGPT

翻译：近期，大型AI语言模型因其令人印象深刻的自然语言能力而受到广泛关注。这些模型在语言相关任务中做出重要贡献，包括基于提示的学习，使其对各种特定任务具有重要价值。这种方法释放了它们的全部潜力，提高了精确性和泛化能力。研究社区正在积极探索其应用，而ChatGPT已获得广泛认可。尽管对大型语言模型开展了大量研究，但它们在推荐场景中的潜力仍有待探索。本研究旨在通过考察ChatGPT作为零样本推荐系统的能力来填补这一空白。我们的目标包括评估其利用用户偏好进行推荐的能力、重新排序现有推荐列表的能力、利用相似用户信息的能力以及处理冷启动情况的能力。我们使用三个数据集（MovieLens Small、Last.FM和Facebook Book）进行综合实验来评估ChatGPT的性能。我们将ChatGPT的性能与标准推荐算法及其他大型语言模型（如GPT-3.5和PaLM-2）进行比较。为衡量推荐效果，我们采用了广泛使用的评估指标，包括平均精确率均值（MAP）、召回率、精确率、F1值、归一化折损累计增益（nDCG）、项目覆盖率、期望流行度补偿（EPC）、长尾平均覆盖率（ACLT）、平均推荐流行度（ARP）以及基于流行度的排名公平机会（PopREO）。通过全面探索ChatGPT在推荐系统中的能力，本研究旨在为大型语言模型多功能性和潜在应用的研究积累做出贡献。我们的实验代码可在GitHub仓库中获取：https://github.com/sisinflab/Recommender-ChatGPT