Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
翻译:主体驱动的文本到图像(T2I)定制在学术界和工业界引起了广泛关注。该任务使预训练模型能够基于特定主体生成新颖图像。现有研究多采用自重建视角,侧重于捕捉单张图像的所有细节,这会将特定图像中与主体无关的属性(如视角、姿态和背景)误解为主体固有属性。这种错误构建会导致对主体无关属性和固有属性的过拟合或欠拟合,即这些属性同时被过度表征或表征不足,从而在相似性与可控性之间形成权衡。本研究提出,理想的表征可通过跨差分视角实现,即通过对比学习将主体固有属性与无关属性解耦,使模型能够通过内部一致性(同一主体的特征在空间上更接近)和外部区分性(不同主体的特征具有显著差异)更专注于固有属性。具体而言,我们提出了CustomContrast这一新颖框架,包含多层级对比学习范式与多模态特征注入编码器。该范式通过跨模态语义对比学习和多尺度外观对比学习,从高层语义到低层外观提取主体的固有特征。为促进对比学习,我们引入了多模态特征注入编码器以捕获跨模态表征。大量实验证明了CustomContrast在主体相似性与文本可控性方面的有效性。