A common objective in the analysis of tabular data is estimating the conditional distribution (in contrast to only producing predictions) of a set of "outcome" variables given a set of "covariates", which is sometimes referred to as the "density regression" problem. Beyond estimation on the conditional distribution, the generative ability of drawing synthetic samples from the learned conditional distribution is also desired as it further widens the range of applications. We propose a flow-based generative model tailored for the density regression task on tabular data. Our flow applies a sequence of tree-based piecewise-linear transforms on initial uniform noise to eventually generate samples from complex conditional densities of (univariate or multivariate) outcomes given the covariates and allows efficient analytical evaluation of the fitted conditional density on any point in the sample space. We introduce a training algorithm for fitting the tree-based transforms using a divide-and-conquer strategy that transforms maximum likelihood training of the tree-flow into training a collection of binary classifiers--one at each tree split--under cross-entropy loss. We assess the performance of our method under out-of-sample likelihood evaluation and compare it with a variety of state-of-the-art conditional density learners on a range of simulated and real benchmark tabular datasets. Our method consistently achieves comparable or superior performance at a fraction of the training and sampling budget. Finally, we demonstrate the utility of our method's generative ability through an application to generating synthetic longitudinal microbiome compositional data based on training our flow on a publicly available microbiome study.
翻译:在表格数据分析中,一个常见目标是估计一组"结果"变量在给定一组"协变量"下的条件分布(而非仅生成预测),这有时被称为"密度回归"问题。除了对条件分布进行估计外,从学习到的条件分布中抽取合成样本的生成能力也是期望的,因为它进一步拓宽了应用范围。我们提出了一种专门针对表格数据密度回归任务的基于流的生成模型。我们的流通过对初始均匀噪声应用一系列基于树的分段线性变换,最终生成给定协变量下(单变量或多变量)结果变量复杂条件密度的样本,并允许在样本空间中任意点上对拟合条件密度进行高效解析评估。我们引入了一种基于分治策略的训练算法来拟合基于树的变换,该算法将树流的最大似然训练转化为在交叉熵损失下训练一组二元分类器——每个树分裂处对应一个分类器。我们通过样本外似然评估来检验本方法的性能,并与多种最先进的条件密度学习方法在一系列模拟和真实基准表格数据集上进行比较。我们的方法始终以极低的训练和采样成本实现相当或更优的性能。最后,我们通过在公开可用的微生物组研究数据上训练我们的流,演示了该方法在生成合成纵向微生物组组成数据应用中的生成能力效用。