Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.
翻译:词汇简化(Lexical Simplification, LS)是指自动将复杂词汇替换为更易理解词汇的任务,旨在使文本对不同目标群体(如低识字能力者、学习障碍者、第二语言学习者)更具可读性。为训练和测试模型,LS系统通常需要包含上下文复杂词汇及其候选替换词的语料库。为进一步提升LS系统性能,我们提出ALEXSIS-PT——一个面向巴西葡萄牙语的新型多候选数据集,包含387个复杂词汇的9,605条候选替换词。该数据集遵循面向西班牙语的ALEXSIS协议构建,为跨语言模型研究开辟了新途径。ALEXSIS-PT是首个包含巴西报纸文章的多候选LS数据集。我们在该数据集上评估了四种替换生成模型:mDistilBERT、mBERT、XLM-R及BERTimbau。实验表明,BERTimbau在所有评估指标上均取得最佳性能。