Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
翻译:子词分词汇已成为分词的事实标准,尽管跨语言的子词词汇质量比较评估较少。现有评估研究侧重于分词算法对下游任务性能的影响,或压缩率等工程标准。我们提出一种新的评估范式,关注子词分词的认知合理性。我们分析了分词器输出与人类在词汇决策任务中的反应时间和准确率的相关性。我们比较了三种分词算法在多种语言和词汇量下的表现。结果表明,与先前研究相反,UnigramLM算法产生的分词行为认知合理性较低,且对派生词素的覆盖更差。