Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.
翻译:摘要:人-物交互(HOI)建模旨在捕捉人类如何作用于物体并与之关联,通常以<人,动作,物体>三元组形式表示。现有方法分为两个不相关的类别:HOI生成从结构化三元组和布局中合成场景,但无法整合混合条件(如HOI与仅含物体实体);HOI编辑通过文本修改交互,却难以将姿态与物理接触解耦,并扩展至多交互场景。我们提出OneHOI,一种统一的扩散Transformer框架,将HOI生成与编辑整合至一个由共享结构化交互表示驱动的单一条件去噪过程中。其核心是关系扩散Transformer(R-DiT),通过角色感知与实例感知的HOI令牌、基于布局的空间动作定位、结构化HOI注意力机制(以强制执行交互拓扑)以及HOI旋转位置编码(以解耦多HOI场景)来建模动词介导的关系。通过在HOI-Edit-44K数据集及HOI与物体中心数据集上联合训练(结合模态丢弃策略),OneHOI支持布局引导、无布局、任意掩码及混合条件控制,在HOI生成与编辑任务中均达到最先进水平。代码见https://jiuntian.github.io/OneHOI/。