Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.
翻译:近期,面向通俗语言生成的系统采用基于平行语料库训练的Transformer模型,以提升健康信息的可理解性。然而,现有语料库规模有限且主题覆盖面狭窄,制约了此类模型的适用性。我们提出CELLS——目前规模最大(6.3万对)、范围最广(涵盖12种期刊)的通俗语言生成平行语料库。其摘要与对应的通俗摘要均由领域专家撰写,确保数据集质量。此外,对专家撰写的简明摘要进行定性评估发现,背景说明是提升可理解性的关键策略。这种说明因需在简化基础上补充源文本未包含的内容,对神经模型的生成构成挑战。我们从CELLS中衍生出两个专用平行语料库,以应对通俗语言生成的核心挑战:生成背景说明和简化原始摘要。我们采用检索增强模型作为背景说明生成任务的直观方案,在保持事实正确性的同时,显著提升了摘要质量与简洁性。综合而言,本研究首次系统探讨了通俗语言生成中的背景说明问题,为向更广泛受众传播科学知识开辟了道路。CELLS已公开于:https://github.com/LinguisticAnomalies/pls_retrieval