Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.
翻译:将定制化对象融入图像生成为文本到图像生成领域提供了颇具吸引力的特性。然而,现有基于优化和基于编码器的方法存在优化耗时、身份保持不足以及普遍存在的复制粘贴效应等缺陷。为克服这些限制,我们提出CustomNet——一种新颖的对象定制方法,该方法将3D新视角合成能力显式融入对象定制过程。这种融合有助于调整空间位置关系与视角,在有效保持对象身份的同时生成多样化输出。此外,我们引入精巧设计,通过文本描述或特定用户定义图像实现位置控制与灵活的背景控制,突破了现有3D新视角合成方法的局限。我们进一步构建了能更好处理真实世界对象与复杂背景的数据集生成流程。借助这些设计,我们的方法无需测试时优化即可实现零样本对象定制,支持对视角、位置和背景的同步控制。因此,CustomNet确保了更强的身份保持能力,并生成多样化且协调的输出。