Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM) pretraining, to create a token dictionary of a prescribed size. Most evaluations of BPE to date are empirical, and the reasons for its good practical performance are not well understood. In this paper we focus on the optimization problem underlying BPE: finding a pair encoding that achieves optimal compression utility. We show that this problem is APX-complete, indicating that it is unlikely to admit a polynomial-time approximation scheme. This answers, in a stronger form, a question recently raised by Zouhar et al. On the positive side, we show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor between $0.333$ and $0.625$. Our results aim to explain the ongoing success of BPE and are, to our knowledge, the first rigorous guarantees on its compression utility that hold for all inputs.
翻译:字节对编码(BPE)是一种广泛使用的子词切分方法,起源于基于语法的文本压缩技术。该方法被应用于机器翻译或大语言模型(LLM)预训练等多种语言处理任务中,以构建指定大小的词符词典。迄今为止,对BPE的评估大多基于经验,其良好实际性能的原因尚未得到充分理解。本文聚焦于BPE背后的优化问题:寻找实现最优压缩效用的配对编码方案。我们证明该问题是APX完全的,这意味着它不太可能具有多项式时间近似方案。这一结论以更强的形式回应了Zouhar等人近期提出的问题。从积极角度看,我们证明BPE对最优配对编码压缩效用的近似比在$0.333$至$0.625$之间。我们的研究结果旨在解释BPE持续成功的原因,据我们所知,这是首个适用于所有输入数据的、关于其压缩效用的严格理论保证。