In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.
翻译:在《分词与无噪声信道》(Zouhar et al., 2023a)中,Rényi效率被提出作为评估分词器的内在机制:对于NLP任务,应选择使一元分布Rényi效率最高的分词器。因此,Rényi效率被当作下游性能的预测指标(例如预测机器翻译任务的BLEU值),从而避免了针对不同分词器训练多个模型的昂贵步骤。尽管该指标具有实用价值,但其预测能力并非完美,作者指出分词方案的良好特性存在Rényi效率本身无法捕捉的额外属性。本文描述了两种BPE分词变体,它们能在降低下游模型性能的同时任意提高Rényi效率。这些反例揭示了Rényi效率作为内在分词指标失效的情形,从而为构建更精确的预测器提供了启示。