Missing data imputation in large-scale surveys faces two challenges that are not well handled by current tabular diffusion methods. First, \emph{structural skips}, cells made inapplicable by questionnaire design, should not be imputed but are often conflated with item nonresponse. Second, \emph{ordinal} responses encode ordered categories, yet most pipelines treat them as nominal levels through one-hot or analog-bit encodings. We introduce \textbf{TabSODA} (\textbf{Tab}ular diffusion with \textbf{S}kip pattern detection and \textbf{O}r\textbf{d}inal \textbf{A}wareness), an Expectation-Maximization (EM)-based diffusion imputer built on the Elucidated Diffusion Model (EDM) framework. TabSODA propagates structural skips through the denoising loss and reverse-time sampler, and represents ordinal variables with cumulative-probit scalar latents while retaining analog-bit encodings for nominal variables. When a codebook skip mask is available, TabSODA uses it directly; otherwise, the TabSODA+SKIP variant estimates the mask from raw responses and questionnaire order using a CART-based skip-pattern miner. On Population Assessment of Tobacco and Health (PATH) study and the National Survey on Drug Use and Health (NSDUH), two nationally representative U.S.\ surveys, TabSODA reduces ordinal MACE by up to $23.7\%$ and improves categorical accuracy by up to $9\%$ over the strongest baseline across MCAR, MAR, and MNAR masking. The skip miner achieves near-perfect precision on both datasets, allowing TabSODA+SKIP to closely track the codebook-mask variant.
翻译:大规模调查中的缺失数据插补面临两个当前表格扩散方法难以妥善处理的挑战。其一,由问卷设计导致的**结构性跳转**单元格不应被插补,但常与项目无应答混为一谈。其二,**序数**响应编码了有序类别,然而多数处理流程通过独热或模拟比特编码将其视为名义水平处理。我们提出 **TabSODA**(具有**跳转模式检测**与**序数感知**的**表格扩散**方法),一种基于阐明扩散模型(EDM)框架的期望最大化(EM)扩散插补器。TabSODA通过去噪损失和反向时间采样器传播结构性跳转,并使用累积-概率标量潜变量表示序数变量,同时保留名义变量的模拟比特编码。当编码本跳转掩码可用时,TabSODA直接使用;否则,TabSODA+SKIP变体通过基于CART的跳转模式挖掘器,从原始响应和问卷顺序中估计掩码。在两项美国全国代表性调查——烟草与健康人口评估(PATH)研究和全国药物使用与健康调查(NSDUH)中,TabSODA在MCAR、MAR和MNAR掩码下将序数MACE最多降低23.7%,并将分类准确率相较于最强基线提升最高9%。该跳转挖掘器在两个数据集上均实现了近乎完美的精确度,使TabSODA+SKIP能够紧密追踪编码本掩码变体的性能。