Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.

翻译：大型语言模型（LLM）已彻底改变了自然语言处理领域，但其仍受限于如字节对编码（BPE）这类固定且不可微的分词器，这阻碍了端到端优化以及对噪声或领域特定数据的适应能力。我们提出了Zonkey，一种分层扩散模型，通过从原始字符到文档级表示的完全可训练流程来解决这些限制。其核心是一个可微分分词器（Segment Splitter），它学习概率性的序列起始（BOS）决策，从而实现自适应分割，这些分割在没有显式监督的情况下会呈现出有语言学意义的结构（例如，空格处的词边界、句号处的句子起始）。这种可微性得益于我们新颖的概率注意力机制，该机制通过引入位置特定的存在概率来模拟理论上无限序列上的软掩码，同时保持梯度。序列以概率方式衰减而非依赖序列结束标记，从而支持可变长度输出。分层结构将序列压缩为更高层次的抽象（例如，从字符n-gram到类词向量，再到类句向量），并通过我们的去噪扩散混合模型（DDMM）在潜在空间中进行稳定高效的去噪重建。一个缝合器（Stitcher）确保跨片段的重叠不变性。在维基百科数据上进行端到端训练后，Zonkey能够从噪声中生成连贯、可变长度的文本，与基于熵的可学习分词器相比，展现出涌现的层次结构以及与数据分布有希望的定性对齐。我们的方法向完全基于梯度的LLM迈进了一步，具有改善领域适应性和可扩展生成的潜力。我们公开了用于训练和复现实验的源代码。