Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.
翻译:文本是目前最普遍的知识与信息来源,应尽可能让更多人轻松获取;然而,文本中常包含阻碍阅读理解与可访问性的复杂词汇。因此,在不改变原意的前提下为复杂词汇推荐更简单的替代词,有助于将信息传递给更广泛的受众。本文提出mTLS——一种基于Transformer的多语言可控词汇简化(LS)系统,该系统通过T5模型进行微调。本研究的创新点在于使用语言特定前缀、控制标记以及从预训练掩码语言模型中提取的候选词,来学习复杂词汇的简单替代方案。在三个知名LS数据集(LexMTurk、BenchLS和NNSEval)上的评估结果表明,我们的模型优于先前最先进的模型(如LSBert和ConLS)。此外,针对近期TSAR-2022多语言LS共享任务数据集的部分评估显示,我们的模型在英语LS任务中与参赛系统相比具有竞争力,甚至在多项指标上超越了GPT-3模型。同时,我们的模型在西班牙语和葡萄牙语上也取得了性能提升。