When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.
翻译:当图像生成过程同时受文本提示和空间线索(如一组边界框)引导时,这些元素是协同作用,还是某一方占据主导?我们对一个将门控自注意力机制集成至U-Net的预训练图像扩散模型进行分析后发现,由于门控自注意力到交叉注意力之间的顺序信息流,空间定位往往压倒文本定位。我们证明,只需将网络架构从门控自注意力与交叉注意力的顺序连接改为并行连接,即可在不牺牲任一定位精度的前提下显著缓解这种偏差。这种出奇简单且有效的解决方案无需对网络进行任何微调,却能显著降低两种定位之间的权衡。实验表明,从原始GLIGEN到改造后的版本,文本定位与空间定位之间的权衡得到了显著改善。