Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions, enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization, posture, and image contours, a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications, such as creating realistic scenes with interacting characters. In this work, we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information, consisting of a triplet label (person, action, object) and corresponding bounding boxes. We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically, we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens, thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models, which outperforms existing baselines by a large margin in HOI detection score, as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion.
翻译:大规模文本到图像(T2I)扩散模型在基于文本描述生成连贯图像方面展示了卓越的能力,推动了内容生成领域的广泛应用。尽管近期进展实现了对目标定位、姿态和图像轮廓等因素的控制,但在生成内容中控制对象之间的交互仍存在关键空白。对生成图像中的交互进行良好控制可催生有意义的应用,例如创建包含交互角色的真实场景。本文研究了利用人-物交互(HOI)信息(包括三元组标签(人、动作、物体)及对应边界框)对T2I扩散模型进行条件控制的问题。我们提出了一种可插拔的交互控制模型InteractDiffusion,该模型扩展了现有的预训练T2I扩散模型,使其能更好地以交互信息为条件。具体而言,我们将HOI信息进行标记化处理,并通过交互嵌入学习其关系。我们训练了一个条件自注意力层,将HOI标记映射到视觉标记,从而在现有T2I扩散模型中更好地约束视觉标记。我们的模型在现有T2I扩散模型上实现了对交互和位置的控制,在HOI检测分数以及FID和KID保真度指标上大幅超越现有基线方法。项目页面:https://jiuntian.github.io/interactdiffusion。