Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.
翻译:将大型数据集压缩为小型合成数据集已在图像分类任务中展现出潜力。然而,先前研究忽视了图像识别中的一个关键问题:确保在压缩数据集上训练的模型对受保护属性(如性别与种族)无偏倚。我们的研究发现,数据集蒸馏(DD)未能缓解原始数据集中对少数群体的不公平性。此外,由于压缩数据集规模更小,这种偏倚通常会进一步加剧。为填补这一研究空白,我们提出了一种新颖的公平数据集蒸馏(FDD)框架——FairDD,该框架可无缝应用于多种基于匹配的DD方法,且无需修改其原始架构。FairDD的核心创新在于将合成数据集与原始数据集中按受保护属性划分的组别进行同步匹配,而非如传统DD方法那样将其与由多数群体主导的整体分布进行无差别对齐。这种同步匹配机制使合成数据集能够避免坍缩至多数群体,并引导其面向所有受保护属性组别实现均衡生成。因此,FairDD能有效规范传统DD方法,使其在保持目标属性分类精度的同时,倾向于对少数群体进行偏置生成。理论分析与大量实验评估表明,FairDD在保持分类准确率的前提下,相较于传统DD方法显著提升了公平性。其在分布匹配与梯度匹配等多种DD方法中表现出的持续优越性,确立了其作为一种通用FDD方法的有效性。