Retrieval-Augmented Generation (RAG) has recently gained significant attention for its enhanced ability to integrate external knowledge sources into open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness considerations may have been largely overlooked. Second, the complex, multi-component architecture of RAG methods poses challenges in identifying and mitigating biases, as each component is optimized for distinct objectives. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG, using scenario-based questions and analyzing disparities across demographic attributes. Our experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages. These findings underscore the need for targeted interventions to address fairness concerns throughout the RAG pipeline. The dataset and code used in this study are publicly available at this GitHub Repository https://github.com/elviswxy/RAG_fairness .
翻译:检索增强生成(RAG)近期因其在开放域问答任务中整合外部知识源的增强能力而受到广泛关注。然而,这些模型如何处理公平性问题,尤其是在涉及性别、地理位置及其他人口统计特征等敏感属性方面,目前尚不明确。首先,随着语言模型向提升实用性(如提高精确匹配准确率)的方向演进,公平性考量可能已被大幅忽视。其次,RAG方法复杂多组件的架构为识别和缓解偏见带来了挑战,因为每个组件均针对不同目标进行了优化。本文旨在通过实证方法评估多种RAG方法的公平性。我们提出了一个针对RAG定制的公平性评估框架,采用基于场景的提问方式,并分析跨人口统计属性的差异。实验结果表明,尽管近期在实用性优化方面取得了进展,但公平性问题在检索和生成阶段均持续存在。这些发现凸显了在整个RAG流程中实施针对性干预以解决公平性问题的必要性。本研究所用数据集和代码已公开于GitHub仓库:https://github.com/elviswxy/RAG_fairness。