CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. Project Page: https://jellyheadandrew.github.io/projects/chorus

翻译：我们提出一种自监督方法，使机器能够理解并建模多样人-物交互在三维空间中的潜在空间常识。这是一项具有挑战性的任务，因为交互存在特定的流形可被视为类人且自然，但即使对于相似的交互，人体姿态与物体几何结构也可能存在差异。这种多样性使得三维交互的标注任务困难且难以扩展，从而限制了监督学习的推理潜力。学习交互过程中人与物体之间三维空间关系的一种途径，是展示人类与同类物体交互时从不同视角拍摄的多张二维图像。我们方法的核心思想是利用生成模型——该模型能根据任意文本提示输入生成高质量二维图像——作为具备有效可控性与视角多样性的"无边界"数据生成器。尽管合成图像在质量上逊于真实图像，但我们证明这些图像足以学习三维人-物空间关系。我们提出多种利用合成图像的策略，包括：(1) 首个利用生成图像模型进行三维人-物空间关系学习的方法；(2) 通过姿态规范化的三维占用推理，以自监督方式从不一致的二维线索中推理三维空间关系的框架；(3) 语义聚类以消除与同类型物体交互的类别歧义；(4) 评估交互三维空间学习质量的新型指标。项目页面：https://jellyheadandrew.github.io/projects/chorus