Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon.
翻译:基于文本到图像的物体定制技术旨在根据文本提示和参考图像生成与目标物体具有相同身份(ID)的图像,已取得显著进展。然而,当前的定制研究主要集中在特定任务上,例如人物定制或虚拟试穿,而在通用物体定制方面存在空白。为此,我们提出AnyMaker,一种创新的零样本物体定制框架,能够生成具有高身份保真度和灵活文本编辑性的通用物体。AnyMaker的有效性源于其新颖的通用身份提取、双层级身份注入和身份感知解耦机制。具体而言,通用身份提取模块通过集成多个自监督模型提取充足的身份信息,以应对通用物体的多样化定制任务。随后,为了在生成过程中为扩散UNet提供尽可能多的身份信息同时不损害文本编辑性,我们设计了全局-局部双层级身份注入模块:其中全局层级的语义身份被注入文本描述,而局部层级的身份细节则通过新增的交叉注意力模块直接注入模型。此外,我们提出身份感知解耦模块,从提取的表征中分离身份相关信息和无关元素,以实现身份与文本描述的高保真生成。为验证方法并推动通用物体定制研究,我们创建了首个大规模通用身份数据集——多类别身份一致(MC-IDC)数据集,包含31.5万个图文样本和1万个类别。实验表明,AnyMaker在通用物体定制中表现出卓越性能,并在相应任务中优于专用方法。代码和数据集即将发布。