Recently, large language models (LLMs) (e.g., GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. We first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by other candidate generation models as candidates. To solve the ranking task by LLMs, we carefully design the prompting template and conduct extensive experiments on two widely-used datasets. We show that LLMs have promising zero-shot ranking abilities but (1) struggle to perceive the order of historical interactions, and (2) can be biased by popularity or item positions in the prompts. We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies. Equipped with these insights, zero-shot LLMs can even challenge conventional recommendation models when ranking candidates are retrieved by multiple candidate generators. The code and processed datasets are available at https://github.com/RUCAIBox/LLMRank.
翻译:近期,大语言模型(LLMs,如GPT-4)已展现出令人瞩目的通用任务解决能力,包括处理推荐任务的潜力。沿着这一研究方向,本文旨在探究LLMs作为推荐系统排序模型的能力。我们首先将推荐问题形式化为条件排序任务,将序列化交互历史视为条件,将由其他候选生成模型检索到的项目视为候选对象。为利用LLMs解决排序任务,我们精心设计了提示模板,并在两个广泛使用的数据集上进行了大量实验。研究表明,LLMs具有出色的零样本排序能力,但存在以下问题:(1)难以感知历史交互的顺序;(2)可能受流行度或提示中项目位置的影响而产生偏差。我们证明,采用专门设计的提示与自举策略可缓解这些问题。结合这些洞见,当候选项目由多个生成器检索时,零样本LLMs甚至能够挑战传统的推荐模型。相关代码与处理后的数据集已开源至https://github.com/RUCAIBox/LLMRank。