Machine Learning (ML) in low-data settings remains an underappreciated yet crucial problem. Hence, data augmentation methods to increase the sample size of datasets needed for ML are key to unlocking the transformative potential of ML in data-deprived regions and domains. Unfortunately, the limited training set constrains traditional tabular synthetic data generators in their ability to generate a large and diverse augmented dataset needed for ML tasks. To address this challenge, we introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. However, not all the data generated by LLMs will improve downstream utility, as for any generative model. Consequently, we introduce a principled curation mechanism, leveraging learning dynamics, coupled with confidence and uncertainty metrics, to obtain a high-quality dataset. Empirically, on multiple real-world datasets, we demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators. Additionally, we provide insights into the LLM generation and curation mechanism, shedding light on the features that enable them to output high-quality augmented datasets.
翻译:低数据环境下的机器学习(ML)仍是一个未被充分重视但至关重要的问题。因此,用于增加机器学习所需数据集样本量的数据增强方法,是释放机器学习在数据匮乏地区和领域变革潜力的关键。遗憾的是,有限的训练集制约了传统表格合成数据生成器生成机器学习任务所需的大规模、多样化增强数据集的能力。为应对这一挑战,我们引入了CLLM,它利用大语言模型(LLMs)的先验知识在低数据场景下进行数据增强。然而,与任何生成模型一样,并非所有由大语言模型生成的数据都能提升下游效用。因此,我们引入了一种基于学习动态的原理性精选机制,结合置信度和不确定性度量,以获得高质量数据集。实证研究表明,在多个真实世界数据集上,CLLM在低数据场景下的性能优于传统生成器。此外,我们深入分析了大语言模型的生成与精选机制,揭示了使其能够输出高质量增强数据集的特征。