Causal discovery (CD) aims to discover the causal graph underlying the data generation mechanism of observed variables. In many real-world applications, the observed variables are vector-valued, such as in climate science where variables are defined over a spatial grid and the task is called spatio-temporal causal discovery. We motivate CD in vector-valued variable setting while considering different possibilities for the underlying model, and highlight the pitfalls of commonly-used approaches when compared to a fully vectorized approach. Furthermore, often the vector-valued variables are high-dimensional, and aggregations of the variables, such as averages, are considered in interest of efficiency and robustness. In the absence of interventional data, testing for the soundness of aggregate variables as consistent abstractions that map a low-level to a high-level structural causal model (SCM) is hard, and recent works have illustrated the stringency of conditions required for testing consistency. In this work, we take a careful look at the task of vector-valued CD via constraint-based methods, focusing on the problem of consistency of aggregation for this task. We derive three aggregation consistency scores, based on compatibility of independence models and (partial) aggregation, that quantify different aspects of the soundness of an aggregation map for the CD problem. We present the argument that the consistency of causal abstractions must be separated from the task-dependent consistency of aggregation maps. As an actionable conclusion of our findings, we propose a wrapper Adag to optimize a chosen aggregation consistency score for aggregate-CD, to make the output of CD over aggregate variables more reliable. We supplement all our findings with experimental evaluations on synthetic non-time series and spatio-temporal data.
翻译:因果发现旨在揭示观测变量数据生成机制背后的因果图。在许多实际应用中,观测变量是向量值的,例如在气候科学中,变量定义在空间网格上,此类任务被称为时空因果发现。本文探讨向量值变量设定下的因果发现问题,同时考虑底层模型的不同可能性,并通过与完全向量化方法的比较,揭示常用方法存在的缺陷。此外,向量值变量往往具有高维特性,为提升效率与鲁棒性,常需考虑对变量进行聚合(如取平均值)。在缺乏干预数据的情况下,检验聚合变量作为将低层结构因果模型映射到高层结构因果模型的一致性抽象是否可靠十分困难,近期研究已证明检验一致性所需条件的严格性。本研究通过基于约束的方法,深入探讨向量值因果发现任务,重点关注该任务中聚合操作的一致性问题。基于独立性模型与(部分)聚合的相容性,我们推导出三种聚合一致性评分,用以量化因果发现问题中聚合映射可靠性的不同方面。我们提出论证:因果抽象的一致性必须与聚合映射在任务依赖层面的一致性区分开来。基于研究结论,我们提出可操作的封装方法Adag,通过优化选定的一致性评分来实现聚合因果发现,从而提高基于聚合变量的因果发现输出结果的可靠性。我们通过合成非时间序列数据与时空数据的实验评估,为所有研究发现提供佐证。