Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text simplification has focused on monolingual settings, with the result that such evidence would be available only in just one language (most often, English). This work addresses this limitation via multilingual simplification, i.e., directly simplifying complex texts into simplified texts in multiple languages. We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses. Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
翻译:自动文本简化旨在将复杂文本转化为简洁版本。该任务在医学领域尤为重要,因为最新医学发现通常通过复杂且专业的技术文章传播,这为普通民众获取前沿医疗信息设置了障碍,进而影响健康素养的提升。现有医学文本简化研究多聚焦于单语言场景,导致相关证据只能以单一语言(通常为英语)呈现。本研究通过多语言简化突破这一局限,即直接将复杂文本简化为多语言简化版本。我们构建了MultiCochrane数据集——首个面向医学领域(涵盖英语、西班牙语、法语和波斯语四种语言)的句子对齐多语言文本简化数据集。通过大量人工评估与分析,我们对这些语言的微调模型与零样本模型进行了评测。尽管现有模型已能生成可行的简化文本,我们仍识别出该数据集可用于解决的一系列突出挑战。