GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency

from arxiv, The project has been started during Omid Taheri's internship at Adobe and as a collaboration with the Max Planck Institute for Intelligent Systems

Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets.

翻译：手部是灵巧且高度多功能的操控器，对于人类与物体及环境的交互至关重要。因此，建模真实的手-物交互（包括单个手指的细微运动）对于计算机图形学、计算机视觉和混合现实应用具有关键意义。以往关于捕捉和建模人类在三维空间中与物体交互的研究主要关注身体和物体运动，常忽略手部姿态。相比之下，我们提出了GRIP——一种基于学习的方法，该方法以身体和物体的三维运动作为输入，并在物体交互前、中、后合成双手的真实运动。在合成手部运动之前，我们首先使用网络ANet对手臂运动进行去噪预处理。随后，我们利用身体与物体之间的时空关系提取两种新型时序交互线索，并通过两阶段推理流程生成手部运动。在第一阶段，我们引入了一种在潜在空间中强化运动时序一致性的新方法，并生成具有一致性的交互运动。在第二阶段，GRIP生成精细化的手部姿态以避免手-物穿透。给定带有噪声的身体与物体运动序列，GRIP可将其升级为包含手-物交互的完整运动。定量实验与感知研究表明，GRIP在性能上优于基线方法，并能泛化至不同运动捕捉数据集中未见过的物体与运动。