The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
翻译:人类之间的交互方式,包括人际距离、空间配置和运动模式,在不同情境下存在显著差异。为使机器能够理解此类复杂且依赖于上下文的行为,必须对多人及其与周围场景环境的关系进行建模。本文提出一个新颖的研究问题:对两个共同参与涉及物体的交互行为的人之间的关联性进行建模。我们将此建模框架称为人-人-物体交互。为克服专门针对HHOI数据集的缺乏,我们提出了一个新采集的HHOI数据集,以及利用图像生成模型合成HHOI数据的方法。作为中间步骤,我们从HHOI中提取出独立的人-物体交互和人-人交互数据,并基于这些数据,采用基于分数的扩散模型训练了文本到HOI及文本到HHI的生成模型。最后,我们提出一个统一的生成框架,该框架整合了两个独立模型,能够在单一高级采样过程中合成完整的HHOI。我们的方法将HHOI生成扩展至多人场景,实现了超过两个个体的交互。实验结果表明,我们的方法能够根据文本描述生成逼真的HHOI,其性能优于以往仅关注单人HOI的方法。此外,我们还将涉及物体的多人运动生成作为本框架的应用方向进行展示。