Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches have narrowed this gap. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. We refer to the resulting model as CoBit (Continuous Bitstream Diffusion). Our approach represents semantic tokens as analog bit sequences and uses a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On LM1B, our 130M-parameter model reaches a generative perplexity (GenPPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our sampler establishes a new continuous-DLM Pareto frontier, achieving GenPPL 27.06 at entropy 5.26 using 4x fewer steps than previous 1024-NFE baselines. Scaling the same recipe to a 462M-parameter model (CoBit-M) further improves the OWT GenPPL-entropy frontier over the 130M model (CoBit-S) and over medium-scale continuous and discrete DLM baselines, reaching GenPPL 19.5 at entropy 5.40, near real-data entropy (5.44), and approaching pretrained GPT-2 Medium over the high-quality region. As an additional benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck of standard DLMs: by predicting O(log V) bitwise logits via semantic bit-patching, it lowers memory and raises throughput, a scalable paradigm as vocabulary sizes grow.
翻译:扩散语言模型(DLMs)承诺实现并行、顺序无关的生成,但在标准基准测试中,其样本质量和多样性历来落后于自回归模型。最近的连续流与扩散方法已缩小了这一差距。在本工作中,我们通过将文本建模为固定宽度二进制比特流上的连续扩散过程,进一步缩小了与自回归模型的差距。我们将所得模型称为CoBit(连续比特流扩散)。该方法将语义标记表示为模拟比特序列,并采用匹配滤波残差参数化,将上下文学习与解析性独立比特后验分离。关键之处在于,我们采用了由熵率分布门控的朗之万型校正随机采样器,在高信息区域集中随机性,而在其他区域保持近乎确定性。在LM1B上,我们130M参数的模型在匹配真实数据熵(4.31)的条件下,使用256次神经函数评估(NFEs)达到生成困惑度(GenPPL)59.76,优于先前的DLM基线并接近自回归参考水平。在OpenWebText(OWT)上,我们的采样器建立了新的连续DLM帕累托前沿,在熵为5.26时以比先前1024次NFE基线少4倍的步数达到GenPPL 27.06。将相同方法扩展到462M参数的模型(CoBit-M)进一步改善了OWT上GenPPL-熵前沿(相较于130M模型(CoBit-S)及中等规模连续与离散DLM基线),在熵为5.40时达到GenPPL 19.5,接近真实数据熵(5.44),并在高质量区域接近预训练的GPT-2 Medium。作为额外优势,比特流扩散消除了标准DLM的O(V)词汇表缩放瓶颈:通过语义比特补丁预测O(log V)逐比特逻辑值,降低了内存消耗并提高了吞吐量,这为词汇表规模增长时提供了一种可扩展的范式。