Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.
翻译:视觉语言模型在智能制造的高层规划中展现出潜力,但其在动态工作单元中的部署面临两个关键挑战:(1) 无状态操作,无法持续跟踪视野外的状态,导致世界状态漂移;(2) 不透明推理,故障难以诊断,导致代价高昂的盲目重试。本文提出VLM-DEWM,一种通过持久化、可查询的动态外部世界模型将VLM推理与世界状态管理解耦的认知架构。每个VLM决策被结构化为一个可外部化的推理轨迹,包含行动提议、世界信念与因果假设,并在执行前依据DEWM进行验证。当故障发生时,预测状态与观测状态间的差异分析支持针对性恢复而非全局重规划。我们在多工位装配、大规模设施探索及诱导故障下的真实机器人恢复任务中评估VLM-DEWM。相较于基线记忆增强VLM系统,VLM-DEWM将状态跟踪准确率从56%提升至93%,将恢复成功率从不足5%提高至95%,并通过结构化记忆显著降低了计算开销。这些结果表明VLM-DEWM为动态制造环境中的长周期机器人操作提供了一种可验证且鲁棒的解决方案。