Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have achieved substantial improvements in accuracy by grounding their responses in external documents that are relevant to the user's query. However, relatively little work has investigated the impact of RAG in terms of fairness. Particularly, it is not yet known if queries that are associated with certain groups within a fairness category systematically receive higher accuracy, or accuracy improvements in RAG systems compared to LLM-only, a phenomenon we refer to as query group fairness. In this work, we conduct extensive experiments to investigate the impact of three key factors on query group fairness in RAG, namely: Group exposure, i.e., the proportion of documents from each group appearing in the retrieved set, determined by the retriever; Group utility, i.e., the degree to which documents from each group contribute to improving answer accuracy, capturing retriever-generator interactions; and Group attribution, i.e., the extent to which the generator relies on documents from each group when producing responses. We examine group-level average accuracy and accuracy improvements disparities across four fairness categories using three datasets derived from the TREC 2022 Fair Ranking Track for two tasks: article generation and title generation. Our findings show that RAG systems suffer from the query group fairness problem and amplify disparities in terms of average accuracy across queries from different groups, compared to an LLM-only setting. Moreover, group utility, exposure, and attribution can exhibit strong positive or negative correlations with average accuracy or accuracy improvements of queries from that group, highlighting their important role in fair RAG. Our data and code are publicly available from Github.
翻译:通过检索增强生成(RAG)增强的大型语言模型(LLM)在基于用户查询相关的外部文档中寻找依据时,显著提高了准确性。然而,关于RAG在公平性方面的影响研究相对较少。特别是,目前尚不清楚在RAG系统中,与公平类别中特定群体相关的查询是否会系统性地获得更高准确性或比纯LLM系统有更大准确性提升,我们将这一现象称为查询群体公平性。本研究通过大量实验,探讨了影响RAG查询群体公平性的三个关键因素:群体曝光度,即由检索器决定的各群体文档在检索结果中的比例;群体效用性,即各群体文档对提升答案准确性的贡献程度,反映了检索器与生成器的交互;以及群体归因,即生成器在生成回复时对各群体文档的依赖程度。我们基于TREC 2022公平排名赛道中的三个数据集,针对文章生成和标题生成两项任务,考察了四个公平类别中群体层面的平均准确性及准确性提升差异。研究发现,与纯LLM设置相比,RAG系统存在查询群体公平性问题,并加剧了不同群体查询之间的平均准确性差异。此外,群体效用性、曝光度和归因可能与对应群体查询的平均准确性或准确性提升呈现强正相关或负相关,凸显了它们在公平RAG中的重要作用。我们的数据和代码已在GitHub上公开。