The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.
翻译:追求实时智能体交互推动了基于扩散的大语言模型作为自回归主干替代方案的研究兴趣,其有望打破序列延迟瓶颈。然而,这种效率提升是否能转化为有效的智能体行为?本研究对扩散大语言模型在两种不同的智能体范式下进行了全面评估:具身智能体(需要长程规划)和工具调用智能体(需要精确格式化)。与效率炒作相反,我们在Agentboard和BFCL上的结果揭示了一个"苦涩的教训":当前的扩散大语言模型无法作为可靠的智能体主干,经常导致系统性失败。(1) 在具身场景中,扩散大语言模型经历多次尝试失败,无法在时序反馈下进行分支决策。(2) 在工具调用场景中,扩散大语言模型在扩散噪声下无法保持符号精度(例如严格的JSON模式)。为评估扩散大语言模型在智能体工作流中的潜力,我们提出了DiffuAgent——一个将扩散大语言模型作为即插即用认知核心的多智能体评估框架。分析表明,扩散大语言模型在非因果角色中表现良好(例如记忆总结和工具选择),但需在去噪过程中融入因果性、精确且逻辑严密的推理机制,才能适用于智能体任务。