Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.
翻译:在用户生成内容的多文档摘要中实现公平性仍然是自然语言处理(NLP)领域的一个关键挑战。现有的摘要方法通常无法确保不同社会群体得到公平的表征,从而导致有偏见的输出。本文提出了两种新颖的公平抽取式摘要方法:基于聚类的FairExtract方法,以及利用GPT-3.5-turbo并施加公平性约束的FairGPT方法。我们使用包含白人、西班牙裔和非裔美国人方言推文的Divsumm摘要数据集对这些方法进行评估,并与相关基线进行比较。通过使用一套全面的摘要质量评估指标(如SUPERT、BLANC、SummaQA、BARTScore和UniEval)以及公平性指标F,结果表明FairExtract和FairGPT在保持有竞争力的摘要质量的同时,实现了更优的公平性。此外,我们引入了复合指标(例如SUPERT+F、BLANC+F),将质量和公平性整合到一个统一的评估框架中,从而为理解这两个目标之间的权衡提供了更细致的视角。这项工作强调了摘要中公平性的重要性,并为未来公平性感知的NLP模型研究设定了基准。