Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. We present a novel approach that learns a 3D proxemics prior of two people in close social interaction. Since collecting a large 3D dataset of interacting people is a challenge, we rely on 2D image collections where social interactions are abundant. We achieve this by reconstructing pseudo-ground truth 3D meshes of interacting people from images with an optimization approach using existing ground-truth contact maps. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution of two people in close social interaction directly in the SMPL-X parameter space. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a user study. Additionally, we introduce a new optimization method that uses the diffusion prior to reconstruct two people in close proximity from a single image without any contact annotation. Our approach recovers more accurate and plausible 3D social interactions from noisy initial estimates and outperforms state-of-the-art methods. See our project site for code, data, and model: muelea.github.io/buddi.
翻译:社交互动是人类行为与沟通的基本方面。个体在他人面前的相对位置(即体距学)传递着社交线索,并影响社交互动的动态。我们提出了一种新方法,可学习两人在紧密社交互动中的三维体距学先验。由于收集包含互动个体的大规模三维数据集具有挑战性,我们依赖社交互动丰富的二维图像集。为此,我们通过优化方法,利用现有的真实接触图,从图像中重建出互动个体的伪真实三维网格。随后,我们采用一种名为BUDDI的新型去噪扩散模型,直接在SMPL-X参数空间中对两人的联合分布进行建模,从而学习体距学。通过我们的生成式体距学模型采样,可生成逼真的三维人际互动,并通过用户研究验证其效果。此外,我们提出了一种新的优化方法,该方法利用扩散先验,在无需任何接触标注的情况下,从单张图像中重建出紧密相邻的两人。我们的方法能从噪声较大的初始估计中恢复出更准确、更合理的三维社交互动,并优于现有最先进方法。代码、数据和模型请参见项目网站:muelea.github.io/buddi。