We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.
翻译:我们提出CDBERT,一种通过融入词典知识与汉字结构信息来增强中文预训练语言模型语义理解能力的新型学习范式。我们将CDBERT的两个核心模块命名为"说文"与"解字":其中"说文"指从中文词典中检索最恰当释义的过程,"解字"指通过结构理解增强汉字字形表征的过程。为促进词典理解,我们设计了三个预训练任务,即掩码条目建模、同义反义词对比学习以及例句学习。我们在现代中文理解基准CLUE与古汉语基准CCLUE上评估了该方法。此外,我们基于收集的古汉语词典提出了新的多义词判别任务PolyMRC。该范式在所有任务中均持续提升了先前的中文预训练语言模型性能。特别地,我们的方法在古汉语理解的小样本场景中取得了显著提升。