Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.
翻译:缺失数据在公共卫生、环境科学和社会科学等诸多科学领域中普遍存在。尽管在典型研究中,通过完全指定的变量级缺失模型来探讨缺失数据的可恢复性,但在许多实际应用场景中,由于知识有限或出于可解释性考虑,变量常被归并为聚类,此时仅能获得粗粒度的结构信息。本文针对此类抽象表示下的可恢复性问题展开研究。我们引入两类基于聚类的缺失图:m-C-DMG保留变量特定的缺失指标,而cm-C-DMG在聚类层面聚合缺失机制。我们形式化定义了这些抽象图与底层变量级缺失模型之间的兼容性概念,并探究该抽象化过程如何影响概率查询与因果查询的可恢复性。具体而言,我们给出了联合分布可恢复性的图条件,以及宏观因果效应可恢复性的图条件。总体而言,我们的研究结果阐明了何时聚类级缺失信息足以支持有效推断,以及何时需要更细粒度的建模。