People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.
翻译:来自不同社会与人口群体的用户在产品评论、医疗、法律、政治等广泛话题上表达着多样化的观点与矛盾看法。公平的摘要应当全面覆盖各类视角,避免对特定群体的代表性不足。然而,当前摘要评估指标与大语言模型(LLM)评价领域尚未对公平抽象摘要展开研究。本文系统研究了针对用户生成数据的公平抽象摘要问题。我们首先将抽象摘要中的公平性正式定义为不使任何人群的视角被弱化,并提出四种无参考自动评估指标,通过度量目标摘要与源文本视角的差异实现量化评估。我们基于社交媒体、在线评论及录音转录文本等六类数据集,对包含三个GPT模型、四个LLaMA模型、PaLM 2与Claude在内的九个大语言模型进行了评估。实验表明,模型生成的摘要与人工撰写的参考摘要均存在公平性不足的问题。我们系统分析了影响公平性的共性因素,并提出三种简洁有效的缓解不公平摘要的方法。相关数据集与代码已开源至https://github.com/psunlpgroup/FairSumm。