Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution $P(y\mid \bm{X})$, as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.
翻译:深度生成模型能够通过生成合成训练数据来缓解数据稀缺和隐私问题,但在低数据量、不平衡的表格数据场景中,它们难以充分学习复杂的数据分布。我们认为,追求完整的联合分布可能并非必要;根据近期理论分析,为提高数据效率,模型应优先学习条件分布 $P(y\mid \bm{X})$。为此,我们提出了 \textbf{ReTabSyn}(一种基于\textbf{强}化学习的\textbf{表}格数据\textbf{合}成流程)来克服这一局限。该流程在合成器训练过程中直接提供关于特征相关性保持的反馈,从而激励生成器在训练数据有限时优先学习最具预测价值的信号,进而提升下游模型的效用。我们基于此方法对基于语言模型的生成器进行了实证微调,在包含小样本、类别不平衡和分布偏移的基准测试中,ReTabSyn 始终优于现有先进基线方法。此外,我们的方法可轻松扩展以控制合成表格数据的各个方面,例如对生成样本施加专家指定的约束条件。