Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.
翻译:不平衡分类与伪相关是数据科学与机器学习中常见的挑战。这两个问题均与数据不平衡相关,即某些数据样本组显著代表性不足,进而损害所学模型的准确性、鲁棒性与泛化能力。近期研究提出利用大型语言模型(LLMs)的灵活性与生成能力(通常基于Transformer架构)来生成合成样本并增强观测数据。在不平衡数据背景下,LLMs被用于对代表性不足的组进行过采样,并已展现出显著的改进效果。然而,此类合成数据方法目前明显缺乏理论支撑。本文建立了新的理论基础,以系统研究合成样本在解决不平衡分类与伪相关问题中的作用。具体而言,我们首先明确量化了合成过采样的效益;其次,分析了合成数据增强中的尺度动态特性,并推导出相应的尺度定律;最后,我们论证了Transformer模型生成高质量合成样本的能力。此外,我们通过大量数值实验验证了基于LLM的合成过采样与增强方法的有效性。