Open-Vocabulary Functional 3D Human-Scene Interaction Generation

Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.

翻译：生成与三维场景进行功能化交互的三维人体模型，仍然是具身智能、机器人学和交互式内容创作等领域中一个具有广泛应用前景的开放性问题。其核心挑战在于同时理解三维场景中功能元素的语义，以及实现功能感知交互所需的三维人体姿态。遗憾的是，现有方法通常缺乏对物体功能及相应人-场景接触关系的显式推理，导致生成的交互动作不真实或功能错误。在本工作中，我们提出了FunHSI，一个无需训练、由功能驱动的框架，能够根据开放词汇的任务提示生成功能正确的人-场景交互。给定一个任务提示，FunHSI首先进行功能感知的接触推理，以识别场景中的功能元素，重建其三维几何结构，并通过接触图对高层交互进行建模。随后，我们利用视觉-语言模型合成执行该任务的图像中的人体，并估计提议的三维身体与手部姿态。最后，通过分阶段优化对提议的三维身体配置进行细化，以确保其物理合理性与功能正确性。与现有方法相比，FunHSI不仅能合成更真实的一般性三维交互（例如“坐在沙发上”），还支持细粒度的功能化人-场景交互（例如“调高室内温度”）。大量实验表明，FunHSI能够在多样化的室内外场景中，持续生成功能正确且物理合理的人-场景交互。