Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
翻译:尽管现代文本到图像(T2I)生成模型已具备较高的语义对齐能力,但在给定提示词下合成多样化图像方面仍存在困难。这种多样性的缺失不仅限制了用户选择,还可能加剧社会偏见。本研究通过几何视角增强T2I多样性。与现有主要依赖基于熵的引导来增加样本差异性的方法不同,我们提出几何感知球面采样(GASS)方法,通过显式控制提示词相关与提示词无关的变异源来增强多样性。具体而言,我们在CLIP嵌入空间中将多样性度量分解为两个正交方向:文本嵌入(捕获与提示词相关的语义变异)和已识别的正交方向(捕获提示词无关的变异,如背景)。基于此分解,GASS通过扩大生成图像嵌入沿两个坐标轴的几何投影分布,并沿生成轨迹通过扩展预测引导T2I采样过程。我们在不同冻结T2I骨干网络(U-Net和DiT,扩散模型与流模型)和基准测试上的实验表明,该方法能有效实现解耦的多样性增强,同时对图像保真度和语义对齐的影响最小。