The remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.
翻译:大型语言模型(LLM)的显著成就催生了一种新型推荐范式——基于LLM的推荐(RecLLM)。然而,值得注意的是,LLM可能包含社会偏见,因此RecLLM推荐的公平性需要进一步研究。为避免RecLLM的潜在风险,亟需评估RecLLM在用户侧不同敏感属性上的公平性。由于RecLLM范式与传统推荐范式存在差异,直接使用传统推荐的公平性基准存在问题。为解决这一困境,我们提出了一种新型基准——基于LLM的推荐公平性(FaiRLLM)。该基准包含精心设计的评估指标和一个数据集,该数据集涵盖了音乐和电影两个推荐场景中的八种敏感属性1。通过运用FaiRLLM基准,我们对ChatGPT进行了评估,发现其在生成推荐时仍对某些敏感属性存在不公平现象。我们的代码和数据集可在https://github.com/jizhi-zhang/FaiRLLM获取。