In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.
翻译:在文本到图像个性化中,一个及时且关键的挑战是生成图像过度拟合参考图像中存在的偏差。我们首先系统地将偏差分为背景偏差、附近物体偏差、关联物体偏差、实质偏差(在风格重新语境化中)和姿态偏差。这些偏差由于与主题嵌入的纠缠而出现在生成图像中。这种不良的嵌入纠缠不仅导致参考图像的偏差映射到生成图像中,还显著降低了生成图像与给定生成提示的一致性。为解决这一挑战,我们提出SID(选择性信息描述),这是一种文本描述策略,它偏离了仅描述主题类别识别的常见方法。SID利用多模态GPT-4生成,可以无缝集成到基于优化的模型中。我们展示了全面的实验结果,并分析了交叉注意力图、主题对齐、非主题解缠和文本对齐。