Continuous Language Diffusion as a Decoder-Interface Problem

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: our evidence suggests that denoising becomes reliable when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin decoder basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 (Text-to-Text Transfer Transformer) token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving an $\approx1.1$--$1.2$ perplexity gap in a structured residual tail. Under conservative held-out gates, a margin rule exits roughly $17$--$28\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

翻译：高斯扰乱的句子嵌入缺乏直接的语言学解释，然而连续扩散语言模型却能从中生成流畅的文本。我们通过嵌入式语言流（ELF）研究这一谜题，并识别出一种解码器盆地机制：我们的证据表明，当轨迹到达本地解码器能够读取稳定标记的区域时，去噪变得可靠。我们提出了一个诊断协议，涵盖可去噪性、语义可恢复性、顺序敏感性、解码器兼容性以及轨迹可靠性。该协议揭示了标量指标所隐藏的失败：低均方误差可能丢弃语言学内容，低困惑度可能反映低熵坍缩，而干净的潜在重构可能与狭窄的解码器盆地共存。一个解码器边际界解释了为何标记恢复取决于边际和本地解码器敏感性，而非仅仅潜在误差。对公开ELF检查点的审计揭示了一个接口相图：早期预测弱可读，轨迹中期的分歧标志着竞争区域，而晚期预测进入高边际解码器盆地。一旦进入，在生成的ELF状态上标记实现出乎意料地简单：冻结的T5（文本到文本转换Transformer）标记嵌入查找恢复了$93$--$96\%$的本地解码器决策，单线性读出在32k样本上达到$97.9\%$的一致性，在结构化残差尾部留下约$\approx1.1$--$1.2$的困惑度差距。在保守的留出门控下，通过明确的诊断监视器，边际规则使去噪步骤提前约$17$--$28\%$退出。对LangFlow、BitstreamDiffusion以及连续潜在扩散语言模型（Cola-DLM）的边界检查表明，当状态对象和解码器发生变化时，相同的接口问题仍然有意义。因此，连续和潜在扩散语言模型应作为表示-解码器系统进行评估。