TOUCH：基于文本引导的自由形式手-物交互可控生成 (TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions)

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

翻译：手-物交互（HOI）是人类表达意图的基础。现有的HOI生成研究主要局限于固定的抓握模式，其控制条件通常与物理先验（如力闭合）或通用的意图指令（即使通过复杂的语言表达）绑定。这种过于泛化的条件施加了强烈的稳定抓握归纳偏置，因而无法捕捉日常HOI的多样性。为克服这些局限，我们提出自由形式HOI生成任务，旨在基于细粒度意图生成可控、多样且物理合理的手-物交互，将HOI从抓握扩展到推、戳、旋转等自由形式交互。为支持此任务，我们构建了WildO2——一个真实场景下的多样化三维HOI数据集，其中包含从网络视频中提取的多样化HOI。具体而言，该数据集涵盖92种意图和610个物体类别下的4.4万组独特交互，每组交互均带有详细语义标注。基于此数据集，我们提出TOUCH框架：该三阶段框架以多层次扩散模型为核心，通过细粒度语义控制生成超越抓握先验的多样化手部姿态。该过程利用显式接触建模进行条件约束，并通过接触一致性与物理约束进行细化以确保真实性。综合实验表明，我们的方法能够生成代表日常活动的可控、多样且物理合理的手部交互。项目页面详见 $\href{https://guangyid.github.io/hoi123touch}{此处}$。