In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.
翻译:在视觉可供性学习领域,以往方法主要利用描绘人类行为模式的大量图像或视频来识别物体操作的动作可能性区域,并在机器人任务中具有多种应用。然而,这些方法面临一个核心挑战——动作模糊性,例如打击或搬运鼓具这类模糊判断,以及处理复杂场景的难度。此外,及时的人类干预对于纠正机器人错误至关重要。为解决这些问题,我们提出了一种面向具身描述的自解释可供性学习(SEA)。该创新使机器人能够表达自身意图,弥合可解释视觉语言描述与视觉可供性学习之间的鸿沟。由于缺乏合适数据集,我们发布了首个针对此任务的数据集及评价指标,其中整合了图像、热力图和具身描述。进一步地,我们提出了一种新颖模型,以简洁高效的方式有效结合可供性定位与自解释功能。大量定性与定量实验证明了该方法的有效性。