In the realm of image generation, creating customized images from visual prompt with additional textual instruction emerges as a promising endeavor. However, existing methods, both tuning-based and tuning-free, struggle with interpreting the subject-essential attributes from the visual prompt. This leads to subject-irrelevant attributes infiltrating the generation process, ultimately compromising the personalization quality in both editability and ID preservation. In this paper, we present DisEnvisioner, a novel approach for effectively extracting and enriching the subject-essential features while filtering out -irrelevant information, enabling exceptional customization performance, in a tuning-free manner and using only a single image. Specifically, the feature of the subject and other irrelevant components are effectively separated into distinctive visual tokens, enabling a much more accurate customization. Aiming to further improving the ID consistency, we enrich the disentangled features, sculpting them into more granular representations. Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality, highlighting the effectiveness and efficiency of DisEnvisioner. Project page: https://disenvisioner.github.io/.
翻译:在图像生成领域,基于视觉提示并结合额外文本指令来创建定制化图像,已成为一项前景广阔的研究方向。然而,现有的方法,无论是基于微调的还是免微调的,都难以从视觉提示中准确解读出与主体本质相关的属性。这导致与主体无关的属性信息渗入生成过程,最终损害了生成结果在可编辑性和身份(ID)保真度两方面的个性化质量。本文提出DisEnvisioner,一种新颖的方法,能够以免微调的方式,仅使用单张图像,有效提取并增强主体本质特征,同时滤除无关信息,从而实现卓越的定制化性能。具体而言,该方法将主体特征与其他无关成分有效分离为不同的视觉标记,从而实现更精确的定制。为了进一步提升身份一致性,我们对解耦后的特征进行增强,将其塑造为更细粒度的表示。实验证明,我们的方法在指令响应(可编辑性)、身份一致性、推理速度以及整体图像质量方面均优于现有方法,凸显了DisEnvisioner的有效性与高效性。项目页面:https://disenvisioner.github.io/。