Automatic perception of human behaviors during social interactions is crucial for AR/VR applications, and an essential component is estimation of plausible 3D human pose and shape of our social partners from the egocentric view. One of the biggest challenges of this task is severe body truncation due to close social distances in egocentric scenarios, which brings large pose ambiguities for unseen body parts. To tackle this challenge, we propose a novel scene-conditioned diffusion method to model the body pose distribution. Conditioned on the 3D scene geometry, the diffusion model generates bodies in plausible human-scene interactions, with the sampling guided by a physics-based collision score to further resolve human-scene inter-penetrations. The classifier-free training enables flexible sampling with different conditions and enhanced diversity. A visibility-aware graph convolution model guided by per-joint visibility serves as the diffusion denoiser to incorporate inter-joint dependencies and per-body-part control. Extensive evaluations show that our method generates bodies in plausible interactions with 3D scenes, achieving both superior accuracy for visible joints and diversity for invisible body parts. The code will be available at https://sanweiliti.github.io/egohmr/egohmr.html.
翻译:在社交互动中自动感知人类行为对于增强现实(AR)和虚拟现实(VR)应用至关重要,其中的关键组成部分是从第一人称视角估计社交伙伴的合理三维人体姿态和形状。该任务的最大挑战之一是第一人称场景中因社交距离过近导致的严重身体截断,这给未观测身体部位带来了极大的姿态歧义性。为应对这一挑战,我们提出了一种新颖的场景条件扩散方法对人体姿态分布进行建模。该扩散模型以三维场景几何为条件,生成符合人体-场景合理交互的身体姿态,并通过基于物理的碰撞分数引导采样过程以进一步解决人体与场景的相互渗透问题。无分类器训练策略支持不同条件下的灵活采样,同时增强生成多样性。以各关节可见性为指导的可见性感知图卷积模型作为扩散降噪器,融合关节间依赖关系并实现各身体部位的独立控制。广泛评估表明,我们的方法能生成与三维场景合理交互的身体姿态,在可见关节的精确性和不可见身体部位的多样性方面均取得优越性能。代码将发布于 https://sanweiliti.github.io/egohmr/egohmr.html。