Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.
翻译:时间序列因果发现是机器学习中的一项基础任务。然而,其广泛应用受到两方面因素的阻碍:一是对无法验证的因果假设的依赖,二是现有基准测试中缺乏面向鲁棒性的评估。为应对这些挑战,我们提出了CausalCompass,一个灵活且可扩展的基准测试套件,旨在评估时间序列因果发现方法在建模假设被违反时的鲁棒性。为展示CausalCompass的实际效用,我们对代表性的时间序列因果发现算法在八种假设违反场景下进行了广泛的基准测试。实验结果表明,没有单一方法能在所有设置中始终保持最优性能。然而,在不同场景下表现出整体优越性能的方法几乎都是基于深度学习的方法。我们进一步提供了超参数敏感性分析,以加深对这些发现的理解。我们还发现一个有些令人惊讶的现象:NTS-NOTEARS在实践中严重依赖标准化预处理,在原始设置下表现不佳,但在标准化后展现出强劲性能。最后,我们的工作旨在对假设违反下的时间序列因果发现方法进行全面而系统的评估,从而促进其在现实世界应用中的更广泛采用。代码和数据集可在 https://github.com/huiyang-yi/CausalCompass 获取。