Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.
翻译:扩散语言模型相比自回归模型具有显著优势,例如全注意力并行解码与灵活生成。然而,这类模型存在训练-推理不匹配的显著缺陷:扩散语言模型通过静态单步掩码预测目标进行训练,但在实际部署中需要执行多步渐进式去噪轨迹。我们提出MemDLM(记忆增强型扩散语言模型),通过双层优化将模拟去噪过程嵌入训练过程来缩小这一差距。内循环更新一组快速权重,形成捕获每个样本局部轨迹经验的参数化记忆,外循环则基于该记忆条件更新基础模型。通过将记忆压力从词元表征转移到参数化载体,MemDLM实现了更快的收敛速度与更低的训练损失。此外,在推理阶段可重新启用内循环执行自适应步骤,从而在长上下文理解任务中取得额外性能提升。我们发现,当该参数化记忆在推理阶段激活时,会展现出新兴的权值内检索机制,助力MemDLM在具有挑战性的"大海捞针"检索任务中进一步突破词元级注意力瓶颈。开源代码:https://github.com/JarvisPei/MemDLM。