Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code and model have been released here~\footnote{https://github.com/xnliang98/MigBERT}.
翻译:预训练语言模型(PLMs)已在各类自然语言处理任务中展现出卓越的提升效果。大多数中文PLMs简单地将输入文本视为字符序列,完全忽略了词汇信息。尽管全词遮蔽(Whole Word Masking)能部分缓解这一问题,但词汇中的语义仍未得到充分表征。本文重新审视了中文PLMs的分词粒度,并提出了一种混合粒度中文BERT模型(MigBERT),该模型同时考虑了字符与词汇。为实现这一目标,我们设计了用于学习字符级与词汇级表征的目标函数。我们在多种中文NLP任务上进行了广泛实验,以评估现有PLMs及所提出的MigBERT模型。实验结果表明,MigBERT在所有任务上均取得了新的最优性能(SOTA)。进一步分析显示,词汇在语义上比字符更为丰富。更有趣的是,我们发现MigBERT同样适用于日语场景。我们的代码与模型已在以下链接中开源:\footnote{https://github.com/xnliang98/MigBERT}。