This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the challenge of data scarcity in language simplification, we crawled professionally simplified German texts and synthesized a corpus using GPT-4. We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance. This paper employs various methodologies for evaluation and demonstrates the limitations of currently used rule-based metrics. Both automatic and manual evaluations reveal that our models can significantly simplify real-world online texts, indicating the potential of synthetic data in improving text simplification.
翻译:本研究开创性地采用合成数据训练生成模型,用于德语文本的文档级简化。我们通过真实网络文本验证了该方法的有效性。针对语言简化中数据稀缺的挑战,我们爬取了专业简化的德语文本,并利用GPT-4合成语料库。我们在该数据上微调了参数量高达130亿的大语言模型,并评估其性能。本文采用多种评估方法,揭示了当前基于规则的评估指标的局限性。自动评估与人工评估均表明,我们的模型能够显著简化真实网络文本,这体现了合成数据在改进文本简化方面的潜力。