CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark framework designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We additionally conduct ablation experiments to explain the strong performance of deep learning-based methods under assumption violations. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The user-friendly implementation, documentation and datasets are available at https://anonymous.4open.science/r/CausalCompass-anonymous-5B4F/.

翻译：时间序列因果发现是机器学习中的基础任务。然而，其广泛采纳受限于对不可检验因果假设的依赖，以及现有基准中缺乏面向鲁棒性的评估。为应对这些挑战，我们提出CausalCompass——一个灵活可扩展的基准框架，用于评估时间序列因果发现（TSCD）方法在建模假设违背后的鲁棒性。为展示CausalCompass的实用价值，我们针对八种假设违背场景，对代表性TSCD算法进行了广泛基准测试。实验结果表明，无单一方法能在所有设置中持续取得最优性能。然而，跨不同场景整体表现优越的方法几乎均为深度学习类方法。我们进一步开展超参数敏感性分析，以深化对这些发现的理解。通过消融实验，我们阐释了深度学习方法在假设违背情境下表现强劲的原因。令人意外的是，我们还发现NTS-NOTEARS在实践中高度依赖标准化预处理：其在原始设置中表现欠佳，但经标准化后性能大幅提升。本工作旨在对假设违背下TSCD方法进行系统全面的评估，从而推动其在真实世界应用中的广泛采纳。用户友好的实现、文档及数据集可通过https://anonymous.4open.science/r/CausalCompass-anonymous-5B4F/获取。