One of the major problems with text simplification is the lack of high-quality data. The sources of simplification datasets are limited to Wikipedia and Newsela, restricting further development of this field. In this paper, we analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify. First, we proposed an alignment algorithm to extract sentence pairs from summarization datasets. Then, we designed four attributes to characterize the degree of simplification and proposed a method to filter suitable pairs. We named these pairs Sum4Simp (S4S). Next, we conducted human evaluations to show that S4S is high-quality and compared it with a real simplification dataset. Finally, we conducted experiments to illustrate that the S4S can improve the performance of several mainstream simplification models, especially in low-resource scenarios.
翻译:文本简化面临的主要问题之一是缺乏高质量数据。简化数据集的来源仅局限于维基百科和Newsela,这限制了该领域的进一步发展。本文分析了文本摘要与文本简化之间的相似性,并利用摘要数据辅助简化过程。首先,我们提出了一种对齐算法,用于从摘要数据集中提取句子对。其次,我们设计了四个属性来表征简化程度,并提出了一种筛选合适句子对的方法。我们将这些句子对命名为Sum4Simp (S4S)。接着,我们通过人工评估表明S4S具有高质量,并将其与真实的简化数据集进行了比较。最后,我们通过实验证明,S4S能够提升多个主流简化模型的性能,尤其是在低资源场景下。