Zero-Shot Learning for the Primitives of 3D Affordance in General Objects

One of the major challenges in AI is teaching machines to precisely respond and utilize environmental functionalities, thereby achieving the affordance awareness that humans possess. Despite its importance, the field has been lagging in terms of learning, especially in 3D, as annotating affordance accompanies a laborious process due to the numerous variations of human-object interaction. The low availability of affordance data limits the learning in terms of generalization for object categories, and also simplifies the representation of affordance, capturing only a fraction of the affordance. To overcome these challenges, we propose a novel, self-supervised method to generate the 3D affordance examples given only a 3D object, without any manual annotations. The method starts by capturing the 3D object into images and creating 2D affordance images by inserting humans into the image via inpainting diffusion models, where we present the Adaptive Mask algorithm to enable human insertion without altering the original details of the object. The method consequently lifts inserted humans back to 3D to create 3D human-object pairs, where the depth ambiguity is resolved within a depth optimization framework that utilizes pre-generated human postures from multiple viewpoints. We also provide a novel affordance representation defined on relative orientations and proximity between dense human and object points, that can be easily aggregated from any 3D HOI datasets. The proposed representation serves as a primitive that can be manifested to conventional affordance representations via simple transformations, ranging from physically exerted affordances to nonphysical ones. We demonstrate the efficacy of our method and representation by generating the 3D affordance samples and deriving high-quality affordance examples from the representation, including contact, orientation, and spatial occupancies.

翻译：人工智能领域的主要挑战之一是教导机器精确响应并利用环境功能，从而具备人类所拥有的功能感知能力。尽管这一领域至关重要，但相关学习研究——尤其三维空间中的学习——进展缓慢，原因在于人机交互的多样变化导致功能注释过程极为繁琐。功能数据稀缺限制了模型对物体类别的泛化学习能力，同时也简化了功能表征方式，仅捕获了功能特征的冰山一角。为攻克这些难题，我们提出一种新颖的自监督方法，仅需给定三维物体即可生成三维功能样例，无需任何人工标注。该方法首先将三维物体捕获为图像，通过修补扩散模型将人体插入图像中生成二维功能图像，并创新性提出自适应掩码算法，确保在不改变物体原始细节的前提下完成人体插入。随后，该方法将插入的人体重新映射至三维空间，构建三维人机配对，其中通过多视角预生成人体姿态的深度优化框架解决深度模糊问题。我们还提出一种基于密集人体点与物体点间相对朝向与接近度的新型功能表征，该表征可便捷地从任意三维人机交互数据集中聚合得到。此表征作为功能原语，可通过简单变换转化为传统功能表征形式，涵盖从物理作用型到非物理型等功能类型。我们通过生成三维功能样本并从中推导出包含接触、朝向与空间占用的高质量功能实例，验证了方法及表征的有效性。