We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a sequence of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest. Our first contributions are to define a natural measure of the incompatibility of a sequence of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By studying the concentration properties of the natural plug-in estimator of this measure, we introduce novel hypothesis tests that we prove have power against all distributions with incompatible covariance matrices. The choice of critical values for our tests rely on a new concentration inequality for the Pearson sample correlation matrix, which may be of interest more widely. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed.
翻译:本文研究检验高维数据集中缺失值是否完全随机缺失(MCAR)的问题。我们将MCAR检验问题松弛为序列协方差矩阵相容性检验问题,其动机在于当维度随样本量增长时该检验方法具有可行性。相容性检验可用于检验含噪声观测的半正定矩阵补全问题的可行性,因此我们的结果可能具有独立研究价值。本文首先定义了序列相关矩阵不相容性的自然度量,该度量可表征为半定规划(SDP)问题的最优值,并建立了关键的对偶性结果,使其能够实际计算与解释。通过研究该度量的自然插件估计量的集中性质,我们提出了新的假设检验方法,并证明该方法能有效检验所有具有不相容协方差矩阵的分布。检验临界值的选取依赖于皮尔逊样本相关矩阵的新集中不等式,该不等式可能具有更广泛的应用价值。通过考虑关键缺失模式实例,我们证明在某些情形下该程序达到极小化极大最优速率。数值模拟进一步验证了该方法的有效性和检验势,即使数据具有重尾分布时依然表现良好。