While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
翻译:尽管BPE和WordPiece等子词分词器通常用于为自然语言处理模型构建词表,但从这些词表中将文本解码为词元序列的方法往往未被明确指定,或与词表构建方法不相适配。我们通过一项受控分析,对四种不同算法和三种词表规模下的七种分词器推理方法进行了评估,该分析基于我们为英语精心设计的新型内在评估套件,融合了形态学、认知学和信息论等多维度度量指标。研究结果表明,对于最常用的分词器,贪婪推理方法的表现出人意料地优秀;而近期提出的上下文感知分词器SaGe在形态对齐方面优于所有其他方法。