Text-to-image (T2I) systems increasingly rely on Large Language Model (LLM)-based text conditioning to interpret and expand user prompts. While this improves prompt understanding and text-image alignment, we find that it can also introduce implicit demographic assumptions, even when demographic attributes are unspecified. To systematically investigate this behavior across varying levels of prompt ambiguity and complexity, we construct a comprehensive benchmark covering diverse prompt settings. Evaluations on eight recent T2I models show that LLM-based systems consistently exhibit stronger demographic skew than non-LLM-based baselines. We further analyze system prompts, a component unique to LLM-based T2I systems that guides prompt interpretation and expansion. Our analyses show that these instructions strongly influence text embeddings, which subsequently leads to biased image generations. Motivated by these findings, we propose FairPro, a training-free debiasing framework that adaptively generates fairness-aware instructions while preserving user intent. Experiments demonstrate that FairPro substantially reduces demographic disparities while maintaining prompt fidelity.
翻译:文本到图像系统日益依赖基于大语言模型的文本条件模块来解读和拓展用户提示。虽然这提升了提示理解能力和文本-图像对齐度,但我们发现它也可能引入隐含的群体假设——即便未指定人口统计属性。为系统研究此现象在不同提示模糊程度与复杂度下的表现,我们构建了覆盖多种提示场景的综合基准测试。对八种最新文本到图像模型的评估显示,基于LLM的系统始终比非LLM基线模型表现出更强的群体偏斜。我们进一步分析了系统提示——这是LLM驱动的文本到图像系统特有的引导提示解读与拓展的组件。分析表明,这些指令会强烈影响文本嵌入,进而导致有偏的图像生成。基于此发现,我们提出FairPro——一种无需训练的去偏框架,能自适应生成公平感知指令并保留用户意图。实验证明,FairPro在维持提示忠实度的同时显著降低了群体差异。