In order for robots to interact with objects effectively, they must understand the form and function of each object they encounter. Essentially, robots need to understand which actions each object affords, and where those affordances can be acted on. Robots are ultimately expected to operate in unstructured human environments, where the set of objects and affordances is not known to the robot before deployment (i.e. the open-vocabulary setting). In this work, we introduce OVAL-Prompt, a prompt-based approach for open-vocabulary affordance localization in RGB-D images. By leveraging a Vision Language Model (VLM) for open-vocabulary object part segmentation and a Large Language Model (LLM) to ground each part-segment-affordance, OVAL-Prompt demonstrates generalizability to novel object instances, categories, and affordances without domain-specific finetuning. Quantitative experiments demonstrate that without any finetuning, OVAL-Prompt achieves localization accuracy that is competitive with supervised baseline models. Moreover, qualitative experiments show that OVAL-Prompt enables affordance-based robot manipulation of open-vocabulary object instances and categories. Project Page: https://ekjt.github.io/OVAL-Prompt/
翻译:为了使机器人能够有效地与物体交互,它们必须理解所遇到的每个物体的形态与功能。本质上,机器人需要理解每个物体提供哪些可执行的动作(可供性),以及这些可供性可以在物体的哪个部位被实施。机器人最终需要在非结构化的人类环境中运行,而这些环境中的物体集合及其可供性在部署前对机器人而言是未知的(即开放词汇场景)。在本工作中,我们提出了OVAL-Prompt,一种基于提示的方法,用于在RGB-D图像中进行开放词汇的可供性定位。该方法通过利用视觉语言模型进行开放词汇的物体部件分割,并借助大语言模型对每个部件分割区域进行可供性语义接地,从而在不进行领域特定微调的情况下,展现出对新颖物体实例、类别和可供性的泛化能力。定量实验表明,在未经任何微调的情况下,OVAL-Prompt实现的定位精度与有监督基线模型相当。此外,定性实验表明,OVAL-Prompt能够支持基于可供性的机器人操作,处理开放词汇的物体实例和类别。项目页面:https://ekjt.github.io/OVAL-Prompt/