Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions.
翻译:数据不平衡与虚假相关性是机器学习和数据科学中常见的挑战。过采样通过人为增加少数类样本数量,已被广泛用于应对这些挑战。本文提出OPAL(基于人工LLM生成数据的过采样),这是一种系统性的过采样方法,利用大语言模型(LLMs)的能力为少数群体生成高质量合成数据。当前基于深度生成模型的合成数据生成研究主要面向预测任务,而我们的方案聚焦于处理数据不平衡与虚假相关问题。更重要的是,我们建立了新颖的理论框架,严格刻画了使用合成数据的优势,并证明了Transformer架构在生成高质量标签与协变量合成数据方面的能力。我们进一步通过大量数值实验,与若干代表性替代方案进行对比,验证了所提方法的有效性。