Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.
翻译:前沿大语言模型在学术界、社会及工业界的应用日益广泛。作为比较模型、输入输出及估算推理成本的常用单位,令牌通常被视为一种稳定的度量标准,人们普遍认为其在不同分词器和语境间具有广泛一致性,从而支持直接比较。然而,分词过程在不同模型和文本领域间存在显著差异,这使得对令牌数量的简单解读可能产生问题。我们通过对分词过程进行全面的实证分析,量化了这种差异性,探究了不同文本数据分布下序列到令牌的压缩特性。我们的分析挑战了关于令牌长度的常见经验法则,发现这些法则过于简化。我们希望本研究能为当代大语言模型中的分词机制提供更清晰的认知与更直观的理解。