DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model

Tabular data plays a crucial role in various domains but often suffers from missing values, thereby curtailing its potential utility. Traditional imputation techniques frequently yield suboptimal results and impose substantial computational burdens, leading to inaccuracies in subsequent modeling tasks. To address these challenges, we propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM). Specifically, DiffImpute is trained on complete tabular datasets, ensuring that it can produce credible imputations for missing entries without undermining the authenticity of the existing data. Innovatively, it can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR). To effectively handle the tabular features in DDPM, we tailor four tabular denoising networks, spanning MLP, ResNet, Transformer, and U-Net. We also propose Harmonization to enhance coherence between observed and imputed data by infusing the data back and denoising them multiple times during the sampling stage. To enable efficient inference while maintaining imputation performance, we propose a refined non-Markovian sampling process that works along with Harmonization. Empirical evaluations on seven diverse datasets underscore the prowess of DiffImpute. Specifically, when paired with the Transformer as the denoising network, it consistently outperforms its competitors, boasting an average ranking of 1.7 and the most minimal standard deviation. In contrast, the next best method lags with a ranking of 2.8 and a standard deviation of 0.9. The code is available at https://github.com/Dendiiiii/DiffImpute.

翻译：摘要：表格数据在众多领域中扮演着关键角色，但常因缺失值问题而削弱其潜在效用。传统填补技术往往效果欠佳且计算负担沉重，导致后续建模任务出现偏差。为应对这些挑战，我们提出DiffImpute——一种新颖的去噪扩散概率模型（DDPM）。具体而言，DiffImpute在完整表格数据集上训练，确保其能为缺失条目生成可信的填补结果，同时不破坏现有数据的真实性。创新之处在于，它可适用于完全随机缺失（MCAR）和随机缺失（MAR）的多种场景。为有效处理DDPM中的表格特征，我们定制了四种表格去噪网络，涵盖MLP、ResNet、Transformer和U-Net。同时提出协调化机制（Harmonization），通过在采样阶段多次注入数据并去噪，增强观测数据与填补数据间的一致性。为在保持填补性能的同时实现高效推理，我们设计了一种改进的非马尔可夫采样过程，与协调化机制配合使用。在七个不同数据集上的实证评估彰显了DiffImpute的卓越性能。具体而言，当采用Transformer作为去噪网络时，它始终超越竞争对手，平均排名达1.7且标准差最小。相比之下，次优方法的排名仅为2.8，标准差为0.9。代码已开源于https://github.com/Dendiiiii/DiffImpute。