Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.
翻译:文本到图像扩散模型实现了令人印象深刻的生成质量,但也继承并放大了训练数据中的偏见,导致对语义属性的覆盖出现偏差。先前的研究通过两种方式解决这一问题。闭集方法缓解预定义公平性类别(如性别、种族)中的偏见,其假设社会显著性少数属性是事先已知的。开集方法则将任务构建为偏见识别,突出主导输出的多数属性。两者都忽视了一个互补的任务:揭示在数据分布中代表性不足(社会、文化或风格上的)、但仍编码在模型表示中的稀有或少数特征。我们提出了RAIGen,据我们所知,这是首个用于在扩散模型中无监督发现稀有属性的框架。RAIGen利用套娃稀疏自编码器以及一种结合神经元激活频率与语义独特性的新颖少数度量,来识别可解释的神经元,其最高激活图像能揭示代表性不足的属性。实验表明,RAIGen能在Stable Diffusion中发现超出固定公平性类别的属性,可扩展至SDXL等更大模型,支持跨架构的系统性审计,并能在生成过程中实现稀有属性的定向增强。