Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically assume that variables' missingness indicator vectors $M_1$, $M_2$, ..., $M_p$ are independent after conditioning on the relevant portion of the data matrix $\mathbf{X}$. As this is often unsuitable for characterising SM in multivariate settings, we introduce a taxonomy for SM, where each ${M}_j$ can depend on $\mathbf{M}_{-j}$ (i.e., all missingness indicator vectors except ${M}_j$), in addition to $\mathbf{X}$. We embed this new framework within the well-established decomposition of mechanisms into MCAR, MAR, and MNAR (Rubin, 1976), allowing us to recast mechanisms into a broader setting, where we can consider the combined effect of $\mathbf{X}$ and $\mathbf{M}_{-j}$ on ${M}_j$. We also demonstrate, via simulations, the impact of SM on inference and prediction, and consider contextual instances of SM arising in a de-identified nationwide (US-based) clinico-genomic database (CGDB). We hope to stimulate interest in SM, and encourage timely research into this phenomenon.
翻译:我们处理大型复杂数据源的能力日益增强,这为我们带来了新的重要应用研究问题,例如如何处理大规模数据库中的缺失值。Mitra等人(2023)注意到结构化缺失(Structured Missingness, SM)现象,即缺失具有潜在结构。现有定义缺失机制的分类体系通常假设,在给定数据矩阵$\mathbf{X}$相关部分的条件后,变量的缺失指示向量$M_1, M_2, \ldots, M_p$相互独立。由于这一假设在多变量场景下往往不适合刻画SM,我们引入了一种针对SM的分类体系,其中每个$M_j$除了依赖$\mathbf{X}$外,还可能依赖$\mathbf{M}_{-j}$(即除$M_j$外的所有缺失指示向量)。我们将这一新框架嵌入已有成熟的机制分解体系(即MCAR、MAR和MNAR,Rubin, 1976),从而将机制重新置于更广泛的背景下,使我们能够考虑$\mathbf{X}$和$\mathbf{M}_{-j}$对$M_j$的联合效应。我们还通过模拟实验展示了SM对推断和预测的影响,并考虑了在脱敏的美国全国临床基因组数据库(CGDB)中出现SM的上下文实例。我们希望能激发对SM的研究兴趣,并鼓励对这一现象的及时探索。