Retrieval augmentation of large language models for lay language generation

Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.

翻译：近期，面向通俗语言生成的系统采用基于平行语料库训练的Transformer模型，以提升健康信息的可理解性。然而，现有语料库规模有限且主题覆盖面狭窄，制约了此类模型的适用性。我们提出CELLS——目前规模最大（6.3万对）、范围最广（涵盖12种期刊）的通俗语言生成平行语料库。其摘要与对应的通俗摘要均由领域专家撰写，确保数据集质量。此外，对专家撰写的简明摘要进行定性评估发现，背景说明是提升可理解性的关键策略。这种说明因需在简化基础上补充源文本未包含的内容，对神经模型的生成构成挑战。我们从CELLS中衍生出两个专用平行语料库，以应对通俗语言生成的核心挑战：生成背景说明和简化原始摘要。我们采用检索增强模型作为背景说明生成任务的直观方案，在保持事实正确性的同时，显著提升了摘要质量与简洁性。综合而言，本研究首次系统探讨了通俗语言生成中的背景说明问题，为向更广泛受众传播科学知识开辟了道路。CELLS已公开于：https://github.com/LinguisticAnomalies/pls_retrieval

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【机器学习术语宝典】机器学习中英文术语表

专知会员服务

62+阅读 · 2020年7月12日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日