We live in an era of data floods, and deep neural networks play a pivotal role in this moment. Natural data inherently exhibits several challenges such as long-tailed distribution and model fairness, where data imbalance is at the center of fundamental issues. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social problems. To address these problems, we leverage the recent generative models advanced in generating high-quality images. In this work, we propose SYNAuG, which utilizes synthetic data to uniformize the given imbalance distribution followed by a simple post-calibration step considering the domain gap between real and synthetic data. This straightforward approach yields impressive performance on datasets for distinctive data imbalance problems such as CIFAR100-LT, ImageNet100-LT, UTKFace, and Waterbirds, surpassing the performance of existing task-specific methods. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be an effective and crucial step in addressing data imbalance concerns.
翻译:我们生活在数据洪流的时代,深度神经网络在此刻扮演着关键角色。自然数据本身存在长尾分布、模型公平性等若干挑战,而数据不平衡正是这些根本性问题的核心。这种不平衡可能导致深度神经网络产生有偏预测,从而引发潜在的严重伦理与社会问题。为解决这些问题,我们利用近期在高质图像生成方面取得进展的生成模型。本文提出SYNAuG框架,该框架通过合成数据均匀化给定的不平衡分布,并考虑真实数据与合成数据之间的域差距进行简单的后校准步骤。这一直接方法在CIFAR100-LT、ImageNet100-LT、UTKFace和Waterbirds等具有显著数据不平衡问题的数据集上取得了令人瞩目的性能,超越了现有任务特定方法的表现。尽管我们未声称该方法可作为数据不平衡问题的完整解决方案,但论证了用合成数据补充现有数据是解决数据不平衡问题的有效且关键步骤。