We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.
翻译:本文致力于解决文本驱动的三维人-物交互运动生成这一挑战性任务。现有方法主要依赖于直接的文本到HOI映射,由于显著的跨模态差异,该方法存在三个关键局限性:(Q1) 次优的人体运动,(Q2) 不自然的物体运动,以及 (Q3) 人与物之间的交互薄弱。为应对这些挑战,我们提出了MP-HOI,这是一个基于四个核心见解的新型框架:(1) 多模态数据先验:我们利用来自大型多模态模型的多模态数据(文本、图像、姿态/物体)作为先验来指导HOI生成,这从数据建模层面解决了Q1和Q2。(2) 增强的物体表示:我们通过融入几何关键点、接触特征和动态属性来改进现有的物体表示,从而实现富有表现力的物体表示,这从数据表示层面解决了Q2。(3) 多模态感知的混合专家模型:我们提出了一种模态感知的MoE模型,用于实现有效的多模态特征融合范式,这从特征融合层面解决了Q1和Q2。(4) 交互监督下的级联扩散:我们设计了一个级联扩散框架,在专门的监督下逐步细化人-物交互特征,这从交互细化层面解决了Q3。综合实验表明,MP-HOI在生成高保真度和细粒度的HOI运动方面优于现有方法。