Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/
翻译:针对文本条件扩散模型在精确布局表征方面的局限性,许多研究通过引入额外信号来约束生成图像中特定属性的分布。尽管取得一定成效,现有方法未能将所述属性在三维平面上的精确定位纳入考量。为此,我们提出一种条件扩散模型,该模型将三维物体放置控制与来自多个示例图像的全局风格语义解耦表示相结合。具体而言,我们首先引入**深度解耦训练**,利用物体的相对深度作为估计器,通过合成图像三元组使模型能够识别未见物体的绝对位置。同时提出**软引导**方法,无需额外定位线索即可将全局语义施加至目标区域。我们构建的统一框架——**融合与征服(CnC)**——以解耦方式整合这些技术以实现多条件定位。实验表明,该方法既能感知不同深度的物体,又提供了组合不同全局语义的局部化物体的通用框架。代码地址:https://github.com/tomtom1103/compose-and-conquer/