Multimodal large language models (MLLMs) have shown remarkable capabilities in cross-modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk-aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi-agent decision-making, the framework enables adaptive, risk-aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk-aware planning and coordinated multi-agent execution compared with state-of-the-art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM-enabled multi-agent systems in real-world environments.
翻译:多模态大语言模型在跨模态理解与推理方面展现出显著能力,为智能辅助系统提供了新机遇,但现有系统在风险感知规划、用户个性化以及将语言规划落地为杂乱家庭环境中可执行技能方面仍存在困难。本文提出MARS——一种基于多模态大语言模型构建的、面向辅助智能的多智能体机器人系统,专为支持残障人士的智能家居机器人设计。该系统整合了四个智能体:视觉感知智能体(从环境图像中提取语义与空间特征)、风险评估智能体(识别并优先处理危险)、规划智能体(生成可执行动作序列)以及评估智能体(进行迭代优化)。通过将多模态感知与分层多智能体决策相结合,该框架能够在动态室内环境中实现自适应、风险感知且个性化的辅助服务。在多个数据集上的实验表明,与最先进的多模态模型相比,所提系统在风险感知规划与协同多智能体执行方面展现出更优的整体性能。本方法还揭示了协作式人工智能在实际辅助场景中的应用潜力,并为在真实环境中部署基于多模态大语言模型的多智能体系统提供了通用方法论。