While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
翻译:尽管BPE和WordPiece等子词分词器通常用于构建NLP模型的词汇表,但将这些词汇表解码为词元序列的具体方法往往未被明确说明,或与其构建方式不相适应。我们在为英语设计的新型内在评估套件上,对四种不同算法和三种词汇表规模下的七种分词器推理方法进行了受控分析,该套件综合了源自形态学、认知科学和信息论的评估指标。研究表明,对于最常用的分词器,贪婪推理方法表现出令人惊讶的良好性能;而近期引入的上下文感知分词器SaGe在形态对齐方面优于所有其他方法。