A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series

Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results.

翻译：因果推断是众多学科中探索因果关系的基础研究课题。然而，并非所有算法都同等适用于特定数据集。例如，某些方法仅能识别线性关系，而其他方法则适用于非线性场景。不同算法对噪声的敏感度以及从耦合与非耦合时间序列中推断因果信息的能力也存在差异。因此，同一输入数据往往会因算法不同而产生不同的因果关系。为获得更稳健的因果推断结果，本文提出一种新颖的数据驱动两阶段多分裂因果集成模型，旨在融合不同因果基算法的优势。与现有方法相比，该集成方法在第一阶段通过数据划分策略降低了噪声影响。具体而言，数据首先被分为若干子集，对各子集分别应用基算法；随后利用高斯混合模型识别不同子集中可能有效的因果关系。在第二阶段，基于三种组合规则合并各基算法识别出的关系。我们采用多项指标评估该集成方法，其中包括新提出的因果集成评估指标。实验使用三个不同规模与复杂度的合成数据集进行，这些数据集专为在已知真实因果关系的前提下测试不同场景下的因果检测能力而设计。实验结果表明，该因果集成模型性能优于所有基算法。在实际应用中，本方法可提供更稳健可靠的因果分析结果。