Large Language Models (LLMs) have recently shown impressive abilities in handling various natural language-related tasks. Among different LLMs, current studies have assessed ChatGPT's superior performance across manifold tasks, especially under the zero/few-shot prompting conditions. Given such successes, the Recommender Systems (RSs) research community have started investigating its potential applications within the recommendation scenario. However, although various methods have been proposed to integrate ChatGPT's capabilities into RSs, current research struggles to comprehensively evaluate such models while considering the peculiarities of generative models. Often, evaluations do not consider hallucinations, duplications, and out-of-the-closed domain recommendations and solely focus on accuracy metrics, neglecting the impact on beyond-accuracy facets. To bridge this gap, we propose a robust evaluation pipeline to assess ChatGPT's ability as an RS and post-process ChatGPT recommendations to account for these aspects. Through this pipeline, we investigate ChatGPT-3.5 and ChatGPT-4 performance in the recommendation task under the zero-shot condition employing the role-playing prompt. We analyze the model's functionality in three settings: the Top-N Recommendation, the cold-start recommendation, and the re-ranking of a list of recommendations, and in three domains: movies, music, and books. The experiments reveal that ChatGPT exhibits higher accuracy than the baselines on books domain. It also excels in re-ranking and cold-start scenarios while maintaining reasonable beyond-accuracy metrics. Furthermore, we measure the similarity between the ChatGPT recommendations and the other recommenders, providing insights about how ChatGPT could be categorized in the realm of recommender systems. The evaluation pipeline is publicly released for future research.
翻译:大型语言模型(LLM)近期在各类自然语言处理任务中展现出卓越性能。在不同LLM中,现有研究已证实ChatGPT在多种任务上表现优异,尤其在零样本/少样本提示条件下。鉴于其成功,推荐系统(RS)研究界开始探索其在推荐场景中的潜在应用。然而,尽管已有多种方法尝试将ChatGPT能力整合至推荐系统,当前研究仍难以在考虑生成模型特性的前提下进行全面评估。现有评估常忽略幻觉生成、重复推荐及封闭域外推荐等问题,仅聚焦于准确性指标,未能考量对超准确性维度的影响。为弥补这一空白,我们提出一套稳健的评估流程以检验ChatGPT作为推荐系统的能力,并对ChatGPT的推荐结果进行后处理以涵盖上述维度。通过该流程,我们采用角色扮演提示策略,在零样本条件下探究ChatGPT-3.5与ChatGPT-4在推荐任务中的表现。我们在电影、音乐和书籍三个领域,从以下三个场景分析模型功能:Top-N推荐、冷启动推荐以及推荐列表重排序。实验表明,在书籍领域ChatGPT的准确性优于基线模型。同时在重排序与冷启动场景中表现突出,且能保持合理的超准确性指标。此外,我们测量了ChatGPT推荐与其他推荐系统的相似度,为理解ChatGPT在推荐系统领域的定位提供新视角。本评估流程已公开发布以供后续研究使用。