The remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.
翻译:大型语言模型(LLMs)的卓越成就催生了一种新型的推荐范式——基于LLM的推荐(RecLLM)。然而,值得注意的是,LLMs可能包含社会偏见,因此RecLLM进行推荐的公平性需要进一步探究。为避免RecLLM的潜在风险,必须针对用户侧各类敏感属性评估其公平性。由于RecLLM范式与传统推荐范式存在差异,直接使用传统推荐的公平性基准存在问题。为应对这一困境,我们提出了一种名为"基于LLM的推荐公平性(FaiRLLM)"的新型基准。该基准包含精心设计的评估指标和一个涵盖音乐与电影两个推荐场景中八类敏感属性的数据集。通过使用我们的FaiRLLM基准对ChatGPT进行评估,我们发现其在生成推荐时仍对某些敏感属性表现出不公平性。我们的代码和数据集可在https://github.com/jizhi-zhang/FaiRLLM获取。