While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit critical deficiencies, such as class imbalance, selection bias, and low fidelity. To address these challenges, building on recent advances in Large Language Models (LLMs), this paper introduces Team-then-Trim (T$^2$), a framework that synthesizes high-quality tabular data through a collaborative team of LLMs, followed by a rigorous three-stage plug-in data quality control (QC) pipeline. In T$^2$, tabular data generation is conceptualized as a manufacturing process: specialized LLMs, guided by domain knowledge, are tasked with generating different data components sequentially, and the resulting products, i.e., the synthetic data, are systematically evaluated across multiple dimensions of QC. Empirical results on both simulated and real-world datasets demonstrate that T$^2$ outperforms state-of-the-art methods in producing high-quality tabular data, highlighting its potential to support downstream models when direct data collection is practically infeasible.
翻译:尽管表格数据是许多现实世界机器学习应用的基础,但获取高质量的表格数据通常需要耗费大量人力且成本高昂。受限于观测数据的稀缺性,表格数据集常常表现出关键缺陷,例如类别不平衡、选择偏差和保真度低。为应对这些挑战,本文基于大语言模型的最新进展,提出了团队协作与数据修剪框架,该框架通过一个协作的LLM团队合成高质量表格数据,并随后执行严格的三阶段插件式数据质量控制流程。在T$^2$中,表格数据生成被概念化为一个制造过程:由领域知识指导的专用LLM按顺序负责生成不同的数据组件,而生成的产品(即合成数据)则会在多个质量控制维度上进行系统评估。在模拟和真实数据集上的实证结果表明,T$^2$在生成高质量表格数据方面优于现有最先进方法,突显了其在直接数据收集实际不可行时支持下游模型的潜力。