Diffusion language models (DLMs) have shown strong potential for general natural language tasks with in-context examples. However, due to the bidirectional attention mechanism, DLMs incur substantial computational cost as context length increases. This work addresses this issue with a key discovery: unlike the sequential generation in autoregressive language models (ARLMs), the diffusion generation paradigm in DLMs allows \textit{efficient dynamic adjustment of the context} during generation. Building on this insight, we propose \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP), a context-optimization method that dynamically selects and inserts in-context examples during generation, rather than providing all examples in the prompt upfront. Results show DIP maintains generation quality while achieving up to 12.9$\times$ inference speedup over standard inference and 1.17$\times$ over KV cache-enhanced inference.
翻译:扩散语言模型(DLMs)凭借上下文示例在通用自然语言任务中展现出强大潜力。然而,由于双向注意力机制的存在,随着上下文长度的增加,DLMs会产生显著的计算开销。本研究通过一项关键发现解决了该问题:与自回归语言模型(ARLMs)的顺序生成不同,DLMs的扩散生成范式允许在生成过程中实现\textit{高效的上下文动态调整}。基于这一洞见,我们提出\textbf{动态上下文规划器}(DIP),这是一种上下文优化方法,能够在生成过程中动态选择并插入上下文示例,而非在提示中预先提供全部示例。实验结果表明,DIP在保持生成质量的同时,相比标准推理实现了高达12.9$\times$的推理加速,相比KV缓存增强推理实现了1.17$\times$的加速。