FairEvalLLM. A Comprehensive Framework for Benchmarking Fairness in Large Language Model Recommender Systems

This paper presents a framework for evaluating fairness in recommender systems powered by Large Language Models (RecLLMs), addressing the need for a unified approach that spans various fairness dimensions including sensitivity to user attributes, intrinsic fairness, and discussions of fairness based on underlying benefits. In addition, our framework introduces counterfactual evaluations and integrates diverse user group considerations to enhance the discourse on fairness evaluation for RecLLMs. Our key contributions include the development of a robust framework for fairness evaluation in LLM-based recommendations and a structured method to create \textit{informative user profiles} from demographic data, historical user preferences, and recent interactions. We argue that the latter is essential for enhancing personalization in such systems, especially in temporal-driven scenarios. We demonstrate the utility of our framework through practical applications on two datasets, LastFM-1K and ML-1M. We conduct experiments on a subsample of 80 users from each dataset, testing and assessing the effectiveness of various prompt construction scenarios and in-context learning, comprising more than 50 scenarios. This results in more than 4000 recommendations (80 * 50 = 4000). Our study reveals that while there are no significant unfairness issues in scenarios involving sensitive attributes, some concerns remain. However, in terms of intrinsic fairness, which does not involve direct sensitivity, unfairness across demographic groups remains significant. The code and data used for this paper are available at: \url{https://shorturl.at/awBFM}.

翻译：本文提出了一个用于评估基于大型语言模型（RecLLMs）的推荐系统中公平性的框架，解决了在多个公平性维度上缺乏统一方法的问题，这些维度包括对用户属性的敏感性、内在公平性以及基于潜在收益的公平性讨论。此外，我们的框架引入了反事实评估，并整合了不同用户群体的考量，以深化RecLLMs公平性评估的讨论。我们的主要贡献包括开发了一个用于基于LLM的推荐系统公平性评估的稳健框架，以及一种结构化方法，可从人口统计数据、历史用户偏好和近期交互中构建“信息性用户画像”。我们论证了后者对于增强此类系统个性化（尤其在时间驱动场景下）至关重要。通过在两份数据集（LastFM-1K和ML-1M）上的实际应用，我们展示了框架的实用性。我们对每个数据集中80名用户的子样本进行了实验，测试并评估了超过50种提示构建场景和上下文学习的有效性，共生成超过4000条推荐（80×50=4000）。研究表明，涉及敏感属性的场景未发现显著不公平问题，但仍存在一定疑虑。然而，在不涉及直接敏感性的内在公平性方面，不同人口群体间的不公平现象依然显著。本文所使用的代码和数据可通过以下链接获取：\url{https://shorturl.at/awBFM}。