Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.
翻译:字节级分词使语言模型能够处理任何Unicode输入,但在遇到罕见或未见字符时,模型可能生成无效的UTF-8序列。我们使用一个355M参数的模型,在包含英语、日语、韩语和中文的80B token平衡多语言语料库上,研究了训练规模与UTF-8生成可靠性之间的关系。我们引入了多种评估协议,将UTF-8结构有效性与语言建模分离开来。UTF-8有效性的收敛速度大约是困惑度的两倍:困惑度在2.1B token后趋于稳定,而UTF-8有效性则需要4.2B token。在无上下文生成中,罕见字符比常见字符实现了更高的结构有效性,这表明频繁字符表示存在过度专门化。通过实验,我们观察到可靠的UTF-8生成是一种独立的能力,需要超越困惑度的评估。