Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.
翻译:句子简化旨在通过降低语言复杂性,同时保留原始含义,使复杂文本更易于理解。然而,由于高质量数据的稀缺,该领域在中资源与低资源语言上的进展仍然有限。为填补这一空白,我们引入了OasisSimp数据集,这是一个涵盖英语、僧伽罗语、泰米尔语、普什图语和泰语五种语言的句子级简化多语言数据集。其中,泰语、普什图语和泰米尔语此前尚无句子简化数据集,而僧伽罗语的数据也极为有限。每个语言的简化数据集均由经过培训的标注人员创建,他们遵循详细的指导原则,在保持意义、流畅性和语法正确性的前提下对句子进行简化。我们在OasisSimp数据集上评估了八个开源权重的多语言大语言模型,观察到高资源语言与低资源语言之间存在显著的性能差异,这突显了多语言环境下句子简化所面临的挑战。因此,OasisSimp数据集不仅提供了一个宝贵的多语言资源,也构成了一个具有挑战性的基准,揭示了当前基于大语言模型的简化方法的局限性,并为未来低资源句子简化的研究铺平了道路。该数据集可通过 https://OasisSimpDataset.github.io/ 获取。