Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.
翻译:预训练语言模型是否掌握关于分词表面信息的知识?我们从分词长度、子串和分词构成三个维度,考察了预训练语言模型通过词或子词嵌入所获取的表面存储信息。同时,评估了模型生成与分词表面相关知识的能力。实验聚焦于12个主要在英语和日语语料上训练的预训练语言模型。结果表明,预训练语言模型具备关于分词长度和子串的知识,但缺乏对分词构成的认识。此外,实验暗示解码器侧在有效利用已获取知识方面存在瓶颈。