FairDD: Fair Dataset Distillation via Synchronized Matching

Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.

翻译：将大型数据集压缩为小型合成数据集已在图像分类任务中展现出潜力。然而，先前研究忽视了图像识别中的一个关键问题：确保在压缩数据集上训练的模型对受保护属性（如性别与种族）无偏倚。我们的研究发现，数据集蒸馏（DD）未能缓解原始数据集中对少数群体的不公平性。此外，由于压缩数据集规模更小，这种偏倚通常会进一步加剧。为填补这一研究空白，我们提出了一种新颖的公平数据集蒸馏（FDD）框架——FairDD，该框架可无缝应用于多种基于匹配的DD方法，且无需修改其原始架构。FairDD的核心创新在于将合成数据集与原始数据集中按受保护属性划分的组别进行同步匹配，而非如传统DD方法那样将其与由多数群体主导的整体分布进行无差别对齐。这种同步匹配机制使合成数据集能够避免坍缩至多数群体，并引导其面向所有受保护属性组别实现均衡生成。因此，FairDD能有效规范传统DD方法，使其在保持目标属性分类精度的同时，倾向于对少数群体进行偏置生成。理论分析与大量实验评估表明，FairDD在保持分类准确率的前提下，相较于传统DD方法显著提升了公平性。其在分布匹配与梯度匹配等多种DD方法中表现出的持续优越性，确立了其作为一种通用FDD方法的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日