IRIS：面向三维可供性分析的交互式响应智能分割系统 (IRIS: Interactive Responsive Intelligent Segmentation for 3D Affordance Analysis)

Recent advancements in large language and vision-language models have significantly enhanced multimodal understanding, yet translating high-level linguistic instructions into precise robotic actions in 3D space remains challenging. This paper introduces IRIS (Interactive Responsive Intelligent Segmentation), a novel training-free multimodal system for 3D affordance segmentation, alongside a benchmark for evaluating interactive language-guided affordance in everyday environments. IRIS integrates a large multimodal model with a specialized 3D vision network, enabling seamless fusion of 2D and 3D visual understanding with language comprehension. To facilitate evaluation, we present a dataset of 10 typical indoor environments, each with 50 images annotated for object actions and 3D affordance segmentation. Extensive experiments demonstrate IRIS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight IRIS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic systems for real-world applications.

翻译：近年来，大型语言模型和视觉-语言模型的进展显著提升了多模态理解能力，然而将高层次语言指令转化为三维空间中的精确机器人动作仍具挑战性。本文提出IRIS（交互式响应智能分割系统），一种无需训练的新型多模态系统，用于三维可供性分割，并构建了评估日常环境中交互式语言引导可供性的基准。IRIS将大型多模态模型与专用三维视觉网络相结合，实现了二维/三维视觉理解与语言理解的无缝融合。为便于评估，我们构建了包含10个典型室内环境的数据集，每个环境配有50张标注了物体动作与三维可供性分割的图像。大量实验表明，IRIS能够处理多样化场景下的交互式三维可供性分割任务，在各评价指标上均展现出具有竞争力的性能。研究结果凸显了IRIS在复杂室内环境中基于可供性理解增强人机交互的潜力，推动了面向实际应用的更直观、高效机器人系统的发展。

相关内容

Iris (数据集)

关注 2

Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。可通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日