Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.
翻译:摘要:科学新闻报道充当了桥梁,巧妙地将复杂的研究文章转化为能与普通公众产生共鸣的报道。此类叙事的自动生成提升了学术见解的可及性。本文提出了一个新语料库,以促进这一范式的发展。该语料库涵盖九个学科领域,包含学术出版物与其对应科学新闻报道的平行汇编。为验证数据集的有效性与可靠性,我们开展了深入分析,揭示了科学新闻叙事与学术手稿在可读性和简洁性上的差异。我们采用最先进的文本生成模型对数据集进行了基准测试,评估过程结合了自动评估与人工评估,为未来科学新闻报道自动生成的研究奠定了基础。本工作相关的数据集与代码可访问 https://dongqi.me/projects/SciNews 获取。