Recent advances in generative AI have been largely driven by large language models (LLMs), deep neural networks that operate over discrete units called tokens. To represent text, the vast majority of LLMs use words or word fragments as the tokens, known as subword tokenization. Subword tokenization obscures fine-grained information, which is problematic, especially for scientific data - such as computer code or biological sequences - where meaning depends on the individual characters. Models that instead operate directly on the byte encoding of text avoid these limitations, but until now they have lagged behind subword-based models in performance. Here we introduce Bolmo, a family of fully open byte-level LLMs that approach the capabilities of subword-based systems. Using a two-stage conversion procedure, we transform existing subword-based models into byte-level models with minimal additional training. The resulting models outperform prior byte-level approaches and excel on character-level reasoning tasks, while remaining competitive across standard benchmarks. By efficiently processing byte-level information, these models achieve practical inference speeds and can be adapted at low cost using the existing ecosystem around the source LLM. Our results remove a long-standing performance barrier to end-to-end byte-level language modeling, demonstrating that models operating on raw text encodings can scale competitively while offering advantages in domains requiring fine-grained textual understanding.
翻译:生成式人工智能的最新进展主要由大型语言模型(LLMs)推动,这些深度神经网络以称为“词元”的离散单元进行操作。为表示文本,绝大多数LLM使用单词或词片段作为词元,即子词分词法。子词分词法会掩盖细粒度信息,这在科学数据(如计算机代码或生物序列)中尤为成问题,因为此类数据的含义依赖于单个字符。直接对文本字节编码进行操作的模型可避免这些限制,但迄今为止其性能一直落后于基于子词的模型。本文介绍Bolmo系列模型,这是一个完全开放的字节级LLM家族,其能力接近基于子词的系统。通过两阶段转换流程,我们将现有的基于子词的模型转化为字节级模型,仅需极少量额外训练。所得模型性能超越先前的字节级方法,在字符级推理任务中表现优异,同时在标准基准测试中保持竞争力。通过高效处理字节级信息,这些模型实现了实用的推理速度,并能够利用源LLM的现有生态系统以低成本进行适配。我们的研究结果消除了端到端字节级语言建模长期存在的性能障碍,证明基于原始文本编码的模型能够实现有竞争力的扩展,并在需要细粒度文本理解的领域中提供优势。