Open-Vocabulary Functional 3D Human-Scene Interaction Generation

Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.

翻译：生成功能性地与三维场景交互的三维人体，仍然是具身人工智能、机器人学和交互式内容创作等领域中一个具有应用价值的开放性问题。核心挑战在于同时推理三维场景中功能元素的语义，以及实现功能感知交互所需的三维人体姿态。遗憾的是，现有方法通常缺乏对物体功能及相应人-场景接触的显式推理，导致生成不自然或功能错误的交互。本工作中，我们提出FunHSI，一个免训练、功能驱动的框架，能够从开放词汇任务提示中生成功能正确的人-场景交互。给定任务提示，FunHSI执行功能感知的接触推理，以识别场景中的功能元素，重建其三维几何结构，并通过接触图建模高层级交互。随后，我们利用视觉-语言模型合成执行该任务的图像中的人体，并估计提议的三维身体与手部姿态。最后，通过分阶段优化对提议的三维身体配置进行细化，以确保物理合理性与功能正确性。与现有方法相比，FunHSI不仅能合成更合理的通用三维交互（例如"坐在沙发上"），同时支持细粒度的功能型人-场景交互（例如"调高室温"）。大量实验表明，FunHSI能在多样化的室内外场景中，持续生成功能正确且物理合理的人-场景交互。