Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.
翻译:分词是每个大型语言模型的基础,但它仍然是一个理论支撑不足且设计不一致的组件。常见的子词方法(如字节对编码)虽提供了可扩展性,却常与语言结构错位、放大偏见,并在跨语言和跨领域场景中浪费模型容量。本文将分词重新定位为核心建模决策而非预处理步骤。我们主张建立一种上下文感知框架,在语言学、领域特性和部署需求的指导下,实现分词器与模型的协同设计。标准化评估和透明化报告对于确保分词选择的可问责性与可比性至关重要。将分词视为核心设计问题而非技术后补环节,能够催生更公平、更高效、更具适应性的语言技术。