Incomplete data are common in real-world tabular applications, where numerical, categorical, and discrete attributes coexist within a single dataset. This heterogeneous structure presents significant challenges for existing diffusion-based imputation models, which typically assume a homogeneous feature space and rely on stochastic denoising trajectories. Such assumptions make it difficult to maintain conditional consistency, and they often lead to information collapse for categorical variables or instability when numerical variables require deterministic updates. These limitations indicate that a single diffusion process is insufficient for mixed-type tabular imputation. We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels. A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables, while a discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample manifolds. The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data. Extensive experiments on multiple real-world datasets show that the proposed framework achieves higher imputation accuracy, more stable sampling trajectories, and improved robustness across MCAR, MAR, and MNAR settings compared with existing diffusion-based and classical methods. These results demonstrate the importance of structure-aware diffusion processes for advancing deep learning approaches to incomplete tabular data.
翻译:不完整数据在现实世界的表格应用中十分常见,其中数值型、分类型和离散型属性共存于同一数据集中。这种异构结构对现有的基于扩散的插补模型构成了重大挑战,这些模型通常假设特征空间是同质的,并依赖于随机去噪轨迹。此类假设使得保持条件一致性变得困难,并常常导致分类变量的信息坍缩,或在数值变量需要确定性更新时引发不稳定性。这些局限性表明,单一的扩散过程不足以处理混合类型的表格数据插补。我们提出了一种混合确定性扩散框架,将异构特征分离为两个互补的生成通道:一个基于连续DDIM的通道为数值变量提供高效且稳定的确定性去噪;另一个受基于环化离散扩散启发的离散潜路径扩散通道,则在不离开其有效样本流形的情况下对分类和离散特征进行建模。两个通道在统一的条件下插补目标下进行训练,从而实现对混合类型不完整数据的连贯重建。在多个真实世界数据集上的广泛实验表明,与现有的基于扩散的方法及经典方法相比,所提出的框架在MCAR、MAR和MNAR设置下均实现了更高的插补精度、更稳定的采样轨迹以及更强的鲁棒性。这些结果证明了结构感知扩散过程对于推进深度学习处理不完整表格数据方法的重要性。