We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a collection of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Our first contributions are to define a natural measure of the incompatibility of a collection of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By analysing the concentration properties of the natural plug-in estimator for this measure, we propose a novel hypothesis test, which is calibrated via a bootstrap procedure and demonstrates power against any distribution with incompatible covariance matrices. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed. Furthermore, tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest.
翻译:我们研究了检验高维数据集中的缺失值是否满足完全随机缺失条件的问题。通过将MCAR检验问题转化为检验一组协方差矩阵的相容性问题,我们实现了在维度随样本量增长时仍可行的检验流程。我们的首要贡献是定义了一个衡量相关性矩阵集合不相容性的自然度量,该度量可表征为半定规划问题的最优值,并建立了关键的对偶性结果以实现其实际计算与解释。通过分析该度量自然插值估计量的集中性质,我们提出了一种新颖的假设检验方法,该方法通过自助法进行校准,并能有效检测任何具有不相容协方差矩阵的分布。通过对缺失结构的关键案例研究,我们证明了所提方法在特定情况下达到极小极大速率最优性。数值模拟进一步验证了该方法即使在重尾数据条件下仍保持有效性和检验功效。此外,相容性检验可用于测试含噪声观测的半正定矩阵完备化问题的可行性,因此我们的研究成果可能具有独立的理论价值。