Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as "in front of" or "behind". These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms CoMPaSS, the current baseline framework, on spatial consistency benchmarks.
翻译:扩散模型已经彻底改变了文本到图像(T2I)合成,能够生成高质量、逼真的图像。然而,它们在准确渲染文本提示中描述的空间关系方面仍然存在困难。为了解决T2I生成中空间信息的缺乏,现有方法通常依赖于外部网络条件约束和预定义布局,这导致了更高的计算成本和降低的灵活性。我们的方法建立在一个精心构建的空间显式提示数据集之上,该数据集从LAION-400M中精心提取和合成,以确保文本描述与空间布局之间的精确对齐。除了此数据集,我们提出了ESPLoRA,一个基于低秩自适应的灵活微调框架,专门设计用于增强生成模型的空间一致性,而无需增加生成时间或损害输出质量。除了ESPLoRA,我们还提出了基于几何约束的精细化评估指标,这些指标能够捕捉如“在...前面”或“在...后面”等三维空间关系。这些指标也揭示了T2I模型中存在的空间偏差,即使这些偏差未能被完全消除,我们的TORE算法也可以策略性地利用它们来进一步提高生成图像的空间一致性。我们的方法在空间一致性基准测试中超越了当前基线框架CoMPaSS。