When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.
翻译:当图像生成过程同时受文本提示和空间线索(如一组边界框)引导时,这些元素是协同工作,还是一方主导另一方?我们对一个将门控自注意力集成到U-Net中的预训练图像扩散模型进行分析发现,由于门控自注意力流向交叉注意力的顺序机制,空间定位往往压制文本定位。我们证明,只需重新布线网络架构,将门控自注意力与交叉注意力从顺序执行改为并行处理,即可显著缓解这种偏差,且不降低任一维度的定位精度。这种出乎意料的简单有效方案无需对网络进行任何微调,却能大幅减少两种定位能力之间的权衡。实验表明,从原始GLIGEN到重新布线版本,文本定位与空间定位之间的权衡获得了显著改善。