Recent advancements in large language and vision-language models have significantly enhanced multimodal understanding, yet translating high-level linguistic instructions into precise robotic actions in 3D space remains challenging. This paper introduces IRIS (Interactive Responsive Intelligent Segmentation), a novel training-free multimodal system for 3D affordance segmentation, alongside a benchmark for evaluating interactive language-guided affordance in everyday environments. IRIS integrates a large multimodal model with a specialized 3D vision network, enabling seamless fusion of 2D and 3D visual understanding with language comprehension. To facilitate evaluation, we present a dataset of 10 typical indoor environments, each with 50 images annotated for object actions and 3D affordance segmentation. Extensive experiments demonstrate IRIS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight IRIS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic systems for real-world applications.
翻译:近年来,大型语言模型和视觉-语言模型的进展显著提升了多模态理解能力,然而将高层次语言指令转化为三维空间中的精确机器人动作仍具挑战性。本文提出IRIS(交互式响应智能分割系统),一种无需训练的新型多模态系统,用于三维可供性分割,并构建了评估日常环境中交互式语言引导可供性的基准。IRIS将大型多模态模型与专用三维视觉网络相结合,实现了二维/三维视觉理解与语言理解的无缝融合。为便于评估,我们构建了包含10个典型室内环境的数据集,每个环境配有50张标注了物体动作与三维可供性分割的图像。大量实验表明,IRIS能够处理多样化场景下的交互式三维可供性分割任务,在各评价指标上均展现出具有竞争力的性能。研究结果凸显了IRIS在复杂室内环境中基于可供性理解增强人机交互的潜力,推动了面向实际应用的更直观、高效机器人系统的发展。