Intelligent Multi-Document Summarisation for Extracting Insights on Racial Inequalities from Maternity Incident Investigation Reports

In healthcare, thousands of safety incidents occur every year, but learning from these incidents is not effectively aggregated. Analysing incident reports using AI could uncover critical insights to prevent harm by identifying recurring patterns and contributing factors. To aggregate and extract valuable information, natural language processing (NLP) and machine learning techniques can be employed to summarise and mine unstructured data, potentially surfacing systemic issues and priority areas for improvement. This paper presents I-SIRch:CS, a framework designed to facilitate the aggregation and analysis of safety incident reports while ensuring traceability throughout the process. The framework integrates concept annotation using the Safety Intelligence Research (SIRch) taxonomy with clustering, summarisation, and analysis capabilities. Utilising a dataset of 188 anonymised maternity investigation reports annotated with 27 SIRch human factors concepts, I-SIRch:CS groups the annotated sentences into clusters using sentence embeddings and k-means clustering, maintaining traceability via file and sentence IDs. Summaries are generated for each cluster using offline state-of-the-art abstractive summarisation models (BART, DistilBART, T5), which are evaluated and compared using metrics assessing summary quality attributes. The generated summaries are linked back to the original file and sentence IDs, ensuring traceability and allowing for verification of the summarised information. Results demonstrate BART's strengths in creating informative and concise summaries.

翻译：在医疗保健领域，每年发生数千起安全事件，但从这些事件中汲取的经验教训未能得到有效整合。利用人工智能分析事件报告，可通过识别反复出现的模式和影响因素，揭示预防伤害的关键洞察。为整合并提取有价值的信息，可采用自然语言处理（NLP）和机器学习技术对非结构化数据进行摘要挖掘，从而可能揭示系统性问题和需优先改进的领域。本文提出I-SIRch:CS框架，该框架旨在促进安全事件报告的整合与分析，同时确保全流程的可追溯性。该框架将基于安全智能研究（SIRch）分类体系的概念标注功能与聚类、摘要及分析能力相结合。利用包含188份经匿名处理的孕产事件调查报告的数据集（其中标注了27个SIRch人为因素概念），I-SIRch:CS通过句子嵌入和k-means聚类将标注语句分组，并借助文件与句子ID保持可追溯性。采用离线先进抽象摘要模型（BART、DistilBART、T5）为每个聚类生成摘要，并通过评估摘要质量属性的指标对这些模型进行评测比较。生成的摘要与原始文件及句子ID相关联，确保可追溯性并允许对摘要信息进行验证。结果表明，BART在生成信息丰富且简洁的摘要方面具有优势。