The model's ability to understand synonymous expression is crucial in many kinds of downstream tasks. It will make the model to better understand the similarity between context, and more robust to the synonym substitution attack. However, many Pretrained Language Model (PLM) lack synonym knowledge due to limitation of small-scale synsets and PLM's pretraining objectives. In this paper, we propose a framework called Sem4SAP to mine synsets from Open Knowledge Graph (Open-KG) and using the mined synsets to do synonym-aware pretraining for language models. We propose to coarsly filter the content in Open-KG and use the frequency information to better help the clustering process under low-resource unsupervised conditions. We expand the mined synsets by migrating core semantics between synonymous expressions.We also propose two novel and effective synonym-aware pre-training methods for injecting synonym knowledge into PLMs.Extensive experiments demonstrate that Sem4SAP can dramatically outperform the original PLMs and other baselines on ten different tasks.
翻译:模型理解同义表达的能力在许多下游任务中至关重要。该能力能使模型更好地把握语境间的相似性,并对同义词替换攻击具有更强的鲁棒性。然而,由于小规模同义词集和预训练目标的局限性,许多预训练语言模型(PLM)缺乏同义知识。本文提出一个名为Sem4SAP的框架,从开放知识图谱(Open-KG)中挖掘同义词集,并利用挖掘所得的同义词集对语言模型进行同义感知预训练。我们提出对开放知识图谱中的内容进行粗粒度过滤,并利用频率信息在低资源无监督条件下更好地辅助聚类过程。通过在同义表达之间迁移核心语义,我们扩展了所挖掘的同义词集。此外,我们还提出了两种新颖且有效的同义感知预训练方法,用于向预训练语言模型注入同义知识。大量实验表明,Sem4SAP在十项不同任务上显著优于原始预训练语言模型及其他基线模型。