We live in a vast ocean of data, and deep neural networks are no exception to this. However, this data exhibits an inherent phenomenon of imbalance. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social consequences. To address these challenges, we believe that the use of generative models is a promising approach for comprehending tasks, given the remarkable advancements demonstrated by recent diffusion models in generating high-quality images. In this work, we propose a simple yet effective baseline, SYNAuG, that utilizes synthetic data as a preliminary step before employing task-specific algorithms to address data imbalance problems. This straightforward approach yields impressive performance on datasets such as CIFAR100-LT, ImageNet100-LT, UTKFace, and Waterbird, surpassing the performance of existing task-specific methods. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be an effective and crucial preliminary step in addressing data imbalance concerns.
翻译:我们生活在一个浩瀚的数据海洋中,深度神经网络也不例外。然而,这些数据呈现出内在的不平衡现象。这种不平衡可能导致深度神经网络产生有偏见的预测,进而引发潜在的严重伦理和社会后果。为应对这些挑战,我们认为,鉴于近期扩散模型在生成高质量图像方面取得的显著进展,使用生成模型是一种有前景的任务理解方法。在本工作中,我们提出了一种简单而有效的基准方法——SYNAuG,该方法将合成数据作为采用任务特定算法处理数据不平衡问题之前的初步步骤。这种直接的方法在CIFAR100-LT、ImageNet100-LT、UTKFace和Waterbird等数据集上取得了令人印象深刻的性能,超越了现有任务特定方法的表现。虽然我们未声称该方法能完全解决数据不平衡问题,但我们论证了在用现有数据补充合成数据作为解决数据不平衡问题的初步步骤是有效且至关重要的。