Cross-lingual science journalism generates popular science stories of scientific articles different from the source language for a non-expert audience. Hence, a cross-lingual popular summary must contain the salient content of the input document, and the content should be coherent, comprehensible, and in a local language for the targeted audience. We improve these aspects of cross-lingual summary generation by joint training of two high-level NLP tasks, simplification and cross-lingual summarization. The former task reduces linguistic complexity, and the latter focuses on cross-lingual abstractive summarization. We propose a novel multi-task architecture - SimCSum consisting of one shared encoder and two parallel decoders jointly learning simplification and cross-lingual summarization. We empirically investigate the performance of SimCSum by comparing it with several strong baselines over several evaluation metrics and by human evaluation. Overall, SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets. Furthermore, we conduct an in-depth investigation into the linguistic properties of generated summaries and an error analysis.
翻译:跨语言科学新闻旨在为外行读者生成与源语言不同的科学文章的科普故事。因此,跨语言科普摘要必须包含输入文档的显著内容,且该内容应连贯、易懂,并使用目标受众的本地语言。我们通过联合训练两个高级自然语言处理任务——简化与跨语言摘要,改进了跨语言摘要生成的这些方面。前者降低语言复杂度,后者聚焦于跨语言抽象式摘要。我们提出了一种新颖的多任务架构SimCSum,该架构包含一个共享编码器和两个并行解码器,联合学习简化与跨语言摘要。通过将其与多个强基线模型在多个评估指标上进行比较,并结合人工评估,我们对SimCSum的性能进行了实证研究。总体而言,SimCSum在两个非合成跨语言科学数据集上相较当前最先进方法展现出统计显著的提升。此外,我们对生成摘要的语言特性进行了深入探究并开展了错误分析。