In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that our method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of our approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.
翻译:在许多现实应用中,数据部分缺失或仅部分观测的情况十分常见。本文基于最大均值差异(MMD)提出了一种新颖的双样本检验方法,该方法能同时处理两个样本中的缺失数据,且无需对缺失机制进行假设。我们的方法通过推导考虑所有可能缺失值后MMD检验统计量的精确数学界来实现。据我们所知,这是目前唯一能在单变量和多变量数据存在任意缺失情况下保证控制第一类错误率的双样本检验方法。仿真结果表明,在数据缺失率为5%至10%的典型情况下,本方法具有良好的统计功效。我们特别强调了该方法在数据非随机缺失场景下的价值——在此类场景中,忽略缺失值或使用常规插补方法均可能无法控制第一类错误率。