Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods.
翻译:扩散模型在缺失数据补全领域已受到关注,但仍存在两个长期被忽视的问题亟待解决:(1) 由扩散模型固有的样本多样化生成过程导致的补全结果不准确;(2) 模型训练阶段掩码矩阵的复杂设计带来的训练困难。针对数值表格数据集领域的这些挑战,本文提出一种称为核化负熵正则化Wasserstein梯度流补全的创新性原理方法。具体而言,基于Wasserstein梯度流理论框架,我们首先证明问题(1)源于基于扩散模型的缺失数据补全中隐式最大化的代价泛函等价于补全目标函数与促进多样化的非负项之和。基于此发现,我们设计了一种包含抑制多样化负熵项的新型代价泛函,并在Wasserstein梯度流框架与再生核希尔伯特空间中推导出KnewImp方法。随后,我们证明KnewImp的补全过程可从另一个与联合分布相关的代价泛函导出,从而无需掩码矩阵设计,自然解决了问题(2)。大量实验表明,我们提出的KnewImp方法显著优于现有最先进方法。