Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias

from arxiv, Extended journal version of "Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation" (arXiv:2507.14221), which appeared at the AIDEM Workshop, ECML-PKDD 2025. This version extends the original with cross-lingual bias analysis, a two-level hierarchical summarisation method, and human annotation validation of the evaluation framework

The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.

翻译：使用大语言模型（LLMs）总结议会程序为提升民主参与的可及性提供了有前景的途径。然而，随着这些系统日益成为政治信息的过滤器与内容框架（在信息到达用户前进行筛选与编排），必须解决其中重要的公平性问题。本研究评估了5种大语言模型（包括专有与开源模型）在欧洲议会全体辩论总结中的表现，以探究该情境下出现的代表性偏见。我们开发了一种基于归因的评估框架，用于衡量辩论总结中发言者层面的纳入与错误代表情况。在所有模型与实验中，我们发现发言者在最终总结中的准确代表性受到以下因素影响：（i）发言顺序（辩论中间环节的发言被系统性排除），（ii）使用语言（非英语发言者的代表忠实度较低），以及（iii）政治派别（中左翼政党获得更优结果）。我们进一步展示了如何分解这些情境下的偏见，以区分包容性偏见（系统性遗漏）与幻觉偏见（系统性错误代表），并探索不同缓解策略的效果。提示策略无法影响这些偏见。相反，我们提出一种分层总结方法，将任务分解为更简单的提取与聚合步骤，实验证明该方法能显著改善所有模型中的位置/发言顺序偏见。这些发现凸显了在多语言民主场景部署LLMs时，需采用领域敏感的评估指标并加强伦理监督。