In this paper, we address the limitations of existing text-to-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. It only requires updating an additional linear network with few parameters at a low computational cost. By developing a linear network that maps conditioning embeddings into a debiased space, we enable the generation of relatively balanced demographic results based on the specified text condition. With comprehensive experiments on face image generation, we show that our method significantly improves image generation fairness with almost the same image quality compared to conventional diffusion models when prompted with descriptions related to humans. By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.
翻译:本文针对现有文本到图像扩散模型在生成与人类相关描述时无法产生人口统计公平结果的局限性展开研究。这类模型往往难以将目标语言语境与社会文化偏见区分开来,导致生成带有偏见的图像。为解决这一挑战,我们提出"公平映射"方法——一种灵活、模型无关且轻量级的方案,通过控制提示词对预训练文本到图像扩散模型进行修改,实现公平图像生成。本方法的核心优势在于其高效性:仅需更新一个参数极少的附加线性网络,且计算成本极低。通过构建将条件嵌入映射到去偏空间的线性网络,我们能够根据指定文本条件生成相对均衡的人口统计结果。基于人脸图像生成的全面实验表明,在输入与人类相关的描述时,我们的方法在保持与传统扩散模型几乎相同图像质量的前提下,显著提升了图像生成的公平性。通过有效解决隐性语言偏见问题,本方法能够生成更加公平且多样化的图像输出。