The remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.
翻译:大型语言模型(LLM)的显著成就催生了一种新型推荐范式——基于LLM的推荐(RecLLM)。然而,值得注意的是,LLM可能包含社会偏见,因此RecLLM所产生推荐的公平性需进一步探究。为避免RecLLM的潜在风险,亟需从用户侧各类敏感属性的角度评估其公平性。由于RecLLM范式与传统推荐范式存在差异,直接沿用传统推荐的公平性基准存在缺陷。为应对这一困境,我们提出了一种新型基准——基于LLM推荐的公平性(FaiRLLM)。该基准包含精心设计的评估指标,并构建了一个涵盖音乐和电影两种推荐场景中八种敏感属性的数据集。通过应用FaiRLLM基准,我们对ChatGPT进行了评估,发现其在生成推荐时仍对某些敏感属性表现出不公平性。我们的代码与数据集可通过https://github.com/jizhi-zhang/FaiRLLM获取。