Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.
翻译:将基于LLM的具身智能体从纯文本环境扩展到复杂多模态设置仍是一项重大挑战。近期研究指出,独立视觉-语言模型存在感知-推理-决策鸿沟,常忽略任务关键信息。本文提出PRISM框架,通过动态问答流水线紧密耦合感知模块(VLM)与决策模块(LLM)。不同于被动接受VLM描述,LLM会对其进行批判性分析,向VLM提出目标导向性问题,并综合生成精简图像描述。这种闭环交互机制能够产生锐利且任务驱动的场景理解。我们在ALFWorld和Room-to-Room基准测试中评估PRISM,结果表明:(1) PRISM显著超越当前最先进的基于图像的模型;(2) 我们的交互式目标导向感知流水线带来系统性与实质性提升;(3) PRISM完全自动化运行,无需人工设计问题或答案。