Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of R\'enyi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the R\'enyi entropy with $\alpha = 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.
翻译:子词分词是许多自然语言处理流程中的关键组成部分。然而,关于为何某些分词器与超参数组合能带来更优下游模型性能的原因尚不明确。我们提出:优质分词器能实现高效通道利用,其中"通道"指将输入信息传递至模型的途径,而信息效率可通过信息论术语量化——即香农熵与分词分布最大可能熵的比值。然而,基于香农熵的最优编码会将极长编码分配给低频分词,而将极短编码分配给高频分词。相比之下,以Rényi熵定义的效率会惩罚包含极高频率或极低频率分词的分布。在机器翻译任务中,我们发现多种分词器下,参数α=2.5的Rényi熵与BLEU值呈现极强相关性(相关系数0.78),而压缩长度仅表现为-0.32的弱相关。