This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 125.42% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.
翻译:本文分析了文本到图像(T2I)扩散模型中文本编码器的因果机制所导致的信息偏差与丢失问题。先前研究主要集中于通过去噪过程解决此类问题,然而尚无研究探讨文本嵌入如何影响T2I模型——特别是在生成多个对象时的作用机制。本文对文本嵌入进行了系统性分析:其一,阐释文本嵌入如何影响生成图像;其二,揭示信息丢失及向首个提及对象产生偏差的内在原因。基于此,我们提出了一种简单高效的文本嵌入平衡优化方法,该方法无需训练即可在稳定扩散模型中将信息平衡度提升125.42%。此外,我们设计了一种新型自动评估指标,其量化信息丢失的准确性超越现有方法,与人工评估结果的一致性达到81%。该指标能有效衡量对象的存在性与准确性,弥补了当前CLIP文本-图像相似度等分布评分指标的局限性。