This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems, which have primarily assessed consumer fairness by comparing recommendation lists generated with and without sensitive user attributes. Such approaches implicitly treat discrepancies in recommended items as biases, overlooking whether these changes might stem from genuine personalization aligned with true preferences of users. Moreover, these earlier studies typically address single sensitive attributes in isolation, neglecting the complex interplay of intersectional identities. In response to these shortcomings, we introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness by considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces diverse user profile sampling strategies-random, top-rated, and recency-focused-to better understand the impact of profile generation fed to LLMs in light of inherent token limitations in these systems. Given that fairness depends on accurately understanding users' tastes and preferences,, these strategies provide a more realistic assessment of fairness within RecLLMs. The results demonstrated that true preference alignment offers a more personalized and fair assessment compared to similarity-based measures, revealing significant disparities when sensitive and intersectional attributes are incorporated. Notably, our study finds that intersectional attributes amplify fairness gaps more prominently, especially in less structured domains such as music recommendations in LastFM.
翻译:本研究对先前关于基于大型语言模型(LLM)的推荐系统中的公平性评估工作持批判性立场。这些研究主要通过比较包含与不包含敏感用户属性时生成的推荐列表来评估消费者公平性。此类方法隐含地将推荐项目的差异视为偏见,而忽略了这些变化是否可能源于与用户真实偏好相符的真正个性化过程。此外,这些早期研究通常孤立地处理单一敏感属性,忽视了交叉身份的复杂相互作用。针对这些不足,我们提出了CFaiRLLM,一个增强的评估框架。该框架不仅纳入了真实偏好对齐,还通过考虑重叠的敏感属性来严格检验交叉公平性。此外,CFaiRLLM引入了多样化的用户画像采样策略——随机采样、高评分采样和近期活跃度采样——以便在考虑LLM系统固有的令牌限制的前提下,更好地理解输入给LLM的画像生成方式所产生的影响。鉴于公平性依赖于准确理解用户的品味和偏好,这些策略为RecLLM内部的公平性提供了更现实的评估。结果表明,与基于相似性的度量相比,真实偏好对齐提供了更具个性化且更公平的评估,并且在纳入敏感属性和交叉属性时揭示了显著的差异。值得注意的是,我们的研究发现交叉属性会更为显著地放大公平性差距,尤其是在结构较弱的领域,例如LastFM中的音乐推荐。