Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
翻译:尽管针对表格数据生成与插补的先进深度学习和生成建模技术已有大量研究,传统方法在插补基准测试中仍持续占据优势。本文提出UnmaskingTrees——一种采用梯度提升决策树逐步去掩码单个特征的简单表格插补(及生成)方法。该方法在插补任务上实现了最先进的性能,在存在缺失值的训练数据生成任务中表现优异,并在标准生成任务上具备竞争力。针对条件生成子问题,我们提出一种表格概率预测方法BaltoBot,该方法通过拟合平衡的提升树分类器树结构实现。与早期方法不同,它无需对条件分布进行参数假设,能够适应多模态分布特征;与新兴的扩散方法相比,它具有快速采样、闭式密度估计和灵活处理离散变量的优势。最后我们将这两种方法视为元算法,展示了基于TabPFN的上下文学习生成建模能力。