WorldAfford: Affordance Grounding based on Natural Language Instructions

Affordance grounding aims to localize the interaction regions for the manipulated objects in the scene image according to given instructions. A critical challenge in affordance grounding is that the embodied agent should understand human instructions and analyze which tools in the environment can be used, as well as how to use these tools to accomplish the instructions. Most recent works primarily supports simple action labels as input instructions for localizing affordance regions, failing to capture complex human objectives. Moreover, these approaches typically identify affordance regions of only a single object in object-centric images, ignoring the object context and struggling to localize affordance regions of multiple objects in complex scenes for practical applications. To address this concern, for the first time, we introduce a new task of affordance grounding based on natural language instructions, extending it from previously using simple labels for complex human instructions. For this new task, we propose a new framework, WorldAfford. We design a novel Affordance Reasoning Chain-of-Thought Prompting to reason about affordance knowledge from LLMs more precisely and logically. Subsequently, we use SAM and CLIP to localize the objects related to the affordance knowledge in the image. We identify the affordance regions of the objects through an affordance region localization module. To benchmark this new task and validate our framework, an affordance grounding dataset, LLMaFF, is constructed. We conduct extensive experiments to verify that WorldAfford performs state-of-the-art on both the previous AGD20K and the new LLMaFF dataset. In particular, WorldAfford can localize the affordance regions of multiple objects and provide an alternative when objects in the environment cannot fully match the given instruction.

翻译：Affordance Grounding旨在根据给定指令定位场景图像中可操作对象的交互区域。该任务的关键挑战在于具身智能体需理解人类指令，分析环境中哪些工具可被使用，以及如何利用这些工具完成指令。近期研究大多仅支持以简单动作标签作为输入指令来定位affordance区域，难以捕捉复杂的人类目标。此外，这些方法通常仅在以物体为中心的图像中识别单一物体的affordance区域，忽略了物体上下文，且难以在复杂场景中定位多个物体的affordance区域以应用于实际场景。为解决这一问题，我们首次提出基于自然语言指令的affordance grounding新任务，将其从以往使用简单标签扩展到复杂人类指令。针对这一新任务，我们提出新框架WorldAfford。我们设计了新颖的Affordance Reasoning Chain-of-Thought Prompting方法，使LLM更精确、更逻辑性地推理affordance知识。随后，利用SAM和CLIP定位图像中与affordance知识相关的物体，并通过affordance区域定位模块识别物体的affordance区域。为建立该新任务的基准并验证框架性能，我们构建了affordance grounding数据集LLMaFF。大量实验证明，WorldAfford在先前AGD20K数据集和新LLMaFF数据集上均达到最先进性能。特别地，WorldAfford能定位多个物体的affordance区域，并在环境中物体无法完全匹配给定指令时提供替代方案。