Recently, large language models (LLMs) (e.g. GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. To conduct our empirical study, we first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by the candidate generation model as candidates. We adopt a specific prompting approach to solving the ranking task by LLMs: we carefully design the prompting template by including the sequential interaction history, the candidate items, and the ranking instruction. We conduct extensive experiments on two widely-used datasets for recommender systems and derive several key findings for the use of LLMs in recommender systems. We show that LLMs have promising zero-shot ranking abilities, even competitive to or better than conventional recommendation models on candidates retrieved by multiple candidate generators. We also demonstrate that LLMs struggle to perceive the order of historical interactions and can be affected by biases like position bias, while these issues can be alleviated via specially designed prompting and bootstrapping strategies. The code to reproduce this work is available at https://github.com/RUCAIBox/LLMRank.
翻译:近期,大语言模型(如GPT-4)展现了令人印象深刻的通用任务解决能力,包括处理推荐任务的潜力。沿着这一研究方向,本文旨在探究大语言模型作为推荐系统排序模型的能力。为开展实证研究,我们首先将推荐问题形式化为条件排序任务,将序列交互历史视为条件,将候选生成模型检索到的物品视为候选。我们采用特定的提示方法来通过大语言模型解决排序任务:通过精心设计包含序列交互历史、候选物品及排序指令的提示模板。我们在推荐系统领域两个广泛使用的数据集上进行了大量实验,总结出大语言模型在推荐系统中应用的关键发现。研究表明,大语言模型具有有前景的零样本排序能力,在多个候选生成器检索到的候选集上,其表现甚至可匹敌或超越传统推荐模型。我们还发现大语言模型难以感知历史交互的顺序,并可能受到位置偏差等影响,而通过专门设计的提示策略和自助采样方法可缓解这些问题。本文的复现代码已开源至https://github.com/RUCAIBox/LLMRank。