Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
翻译:从文本建模并生成三维人-物交互对于增强现实、扩展现实和游戏应用至关重要。现有方法通常依赖于文本到图像模型的分数蒸馏,但由于高质量交互数据的稀缺,其结果常存在多面性问题且无法忠实遵循文本提示。我们提出了Hoi3DGen,一个能够精确遵循输入交互描述、生成高质量带纹理人-物交互网格的框架。我们首先利用多模态大语言模型构建真实且高质量的交互数据集,随后建立了一个完整的文本到三维生成流程,该流程在交互保真度上实现了数量级的提升。我们的方法在文本一致性上超越基线4-15倍,在三维模型质量上超越3-7倍,展现出对不同类别和交互类型的强大泛化能力,同时保持了高质量的三维生成效果。