Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code has been released here~\footnote{\url{https://github.com/xnliang98/MigBERT}} and you can download our model here~\footnote{\url{https://huggingface.co/xnliang/MigBERT-large/}}.
翻译:预训练语言模型在各种自然语言处理任务上展现了显著提升。大多数中文预训练语言模型简单地将输入文本视为字符序列,完全忽略了词语信息。尽管全词掩码可以缓解这一问题,但词语中的语义仍未得到良好表示。本文重新审视了中文预训练语言模型的分词粒度,并提出了一种混合粒度中文BERT(MigBERT),该方法同时考虑了字符和词语。为实现这一目标,我们设计了用于学习字符级和词语级表示的目标函数。我们在多种中文NLP任务上进行了广泛实验,评估了现有预训练语言模型以及所提出的MigBERT。实验结果表明,MigBERT在所有任务上均取得了新的最佳性能。进一步分析表明,词语的语义丰富性优于字符。更有趣的是,我们证明了MigBERT同样适用于日语。我们的代码已在~\footnote{\url{https://github.com/xnliang98/MigBERT}} 发布,模型可在~\footnote{\url{https://huggingface.co/xnliang/MigBERT-large/}} 下载。