Diffusion models have emerged as a robust framework for various generative tasks, such as image and audio synthesis, and have also demonstrated a remarkable ability to generate mixed-type tabular data comprising both continuous and discrete variables. However, current approaches to training diffusion models on mixed-type tabular data tend to inherit the imbalanced distributions of features present in the training dataset, which can result in biased sampling. In this research, we introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data in terms of performance and fairness.
翻译:扩散模型已成为多种生成任务(如图像和音频合成)的强大框架,并已展现出生成包含连续变量和离散变量的混合类型表格数据的卓越能力。然而,当前在混合类型表格数据上训练扩散模型的方法往往继承了训练数据集中存在的特征不平衡分布,这可能导致有偏采样。在本研究中,我们提出了一种公平扩散模型,旨在生成关于敏感属性的平衡数据。我们通过实验证据表明,该方法在保持生成样本质量的同时,有效缓解了训练数据中的类别不平衡问题。此外,我们进一步证明,本方法在性能与公平性方面均优于现有的表格数据合成方法。