Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products. In this paper, we conduct a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user's rating for a candidate item based on their past ratings. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.
翻译:大型语言模型(LLMs)在零样本或小样本泛化新任务方面展现出卓越能力。然而,LLMs能否基于用户历史行为理解其偏好,仍是一个尚待明确的新兴研究问题。传统上,协同过滤(CF)是该任务最有效的方法,主要依赖大量评分数据。相比之下,LLMs通常仅需少量数据即可保持对每个物品(如电影或产品)的全面世界知识。本文在经典的用户评分预测任务中,对CF与LLMs进行了系统比较。我们探究了参数规模从2.5亿到5400亿不等的多种LLM,评估其在零样本、小样本及微调场景下的性能。通过全面分析LLMs与强基线CF方法的差异,我们发现:零样本LLMs落后于依赖用户交互数据的传统推荐模型,揭示了用户交互数据的重要性。然而,通过微调,LLMs仅需少量训练数据即可达到甚至超越传统模型的性能,展现了其数据高效的潜力。