DLLM Agent: See Farther, Run Faster

Huiling Zhen,Weizhe Lin,Renxi Liu,Kai Han,Yiming Li,Yuchuan Tian,Hanting Chen,Xiaoguang Li,Xiaosong Li,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Youliang Yan,Peifeng Qin,Jun Wang,Yu Wang,Dacheng Tao,Yunhe Wang

Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.

翻译：扩散大语言模型（DLLM）已成为自回归（AR）解码的一种替代方案，具有引人关注的效率和建模特性，但其对智能体多步决策的影响尚未得到充分探索。我们提出一个具体问题：当生成范式改变，而智能体框架和监督方式保持不变时，扩散主干是否会系统性地引发不同的规划与工具使用行为？这些差异是否能转化为端到端的效率提升？我们在受控设置下进行研究，将DLLM和AR主干实例化到同一智能体工作流（DeepDiver）中，并对相同的轨迹数据执行匹配的智能体导向微调，从而得到基于扩散的DLLM Agent和可直接比较的AR Agent。通过基准测试和案例研究，我们发现，在相近的准确率下，DLLM Agent的端到端速度平均比AR Agent快30%以上，部分案例甚至实现8倍加速。在正确完成任务的前提下，DLLM Agent所需的交互轮次和工具调用次数也更少，这与更高的规划器命中率一致——该命中率能更早收敛到正确的动作路径，且回溯更少。我们进一步识别出在工具使用型智能体中部署扩散主干的两项实践考量。首先，原始的DLLM策略更容易出现结构化的工具调用失败，因此需要更强的工具调用专用训练以输出有效模式与参数。其次，对于穿插上下文和动作块的多轮输入，扩散式块破坏需配合对齐的注意力掩码，以避免虚假的上下文-动作信息流；若未进行此类对齐，性能将下降。最后，我们分析了跨工作流阶段的注意力动态，观察到范式特定的协调模式，这表明扩散支持的智能体中存在更强的全局规划信号。