Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.
翻译:分词是预训练语言模型的基础。现有的中文预训练语言模型分词方法通常将每个汉字视为不可分割的标记,但忽略了汉字书写系统中低于字级别的额外语言信息(即子字级特征)。为利用此类信息,我们提出子字级分词方法(简称SubChar)。具体而言,我们首先通过字形或读音将每个汉字转换为短序列来编码输入文本,然后基于编码后的文本通过子词切分构建词汇表。实验结果表明,SubChar分词器相较于现有分词器具有两个主要优势:1)能够将输入切分为更短的序列,从而提升计算效率;2)基于读音的SubChar分词器可将中文同音字编码为相同的转写序列并生成相同的分词结果,因此对同音错别字具有鲁棒性。同时,使用SubChar分词器训练的模型在下游任务上具有竞争力的表现。我们已在https://github.com/thunlp/SubCharTokenization 公开代码和模型,以促进后续研究。