Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
翻译:基于大语言模型的机器人操作任务规划是一个新兴领域。现有方法依赖专用模型、微调或提示调优,且通常以开环方式运行,缺乏稳健的环境反馈,因此在动态场景中表现脆弱。MALLVI提出了一种多智能体大语言与视觉框架,实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVI可生成机器人操作器可执行的原子动作序列。动作执行后,视觉语言模型评估环境反馈并决定重复当前流程或进入下一步骤。不同于采用单一模型的方法,MALLVI协调了Decomposer、Localizer、Thinker和Reflector四个专用智能体,分别管理感知、定位、推理与高层规划。可选的Descriptor智能体提供初始状态的视觉记忆。Reflector通过仅重激活相关智能体实现针对性错误检测与恢复,避免全局重新规划。仿真与真实环境实验表明,迭代式闭环多智能体协同可提升零样本操作任务的泛化能力及成功率。代码已开源:https://github.com/iman1234ahmadi/MALLVI。