With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on embodied decision-making tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.
翻译:借助大语言模型(LLMs)和视觉-语言模型(VLMs)强大的推理能力,近期诸多研究探索将其用于决策制定。然而,这些方法大多依赖基于语言的推理,限制了其推理与明智决策的能力。近期,支持多模态输入与输出的统一多模态模型(UMMs)展现出新的前景。我们认为此类模型通过生成视觉内容进行推理,在决策制定中具有更大潜力。为此,我们提出Uni-Plan——一种基于UMMs的规划框架。在该框架中,单个模型同时充当策略、动力学模型和价值函数。此外,为避免动力学预测中的幻觉,我们提出新颖的自主判别过滤方法:生成模型作为自主判别器过滤无效的动力学预测。在具身决策任务上的实验表明,Uni-Plan相比基于VLM的方法显著提升了成功率,同时展现出强大的数据可扩展性——无需专家演示,在相同训练数据量下即可取得更优性能。本工作为基于UMMs的推理与决策制定研究奠定了基础。