Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.
翻译:深度学习在时间序列因果推断方面取得了显著进展,但由于缺乏包含可观测反事实结果的真实基准数据集,其发展仍受到制约。现有数据集要么依赖无真实反事实标签的实际观测数据,要么采用无法捕捉复杂因果动态的简化模拟。为弥补这一不足,我们构建了一个面向动态干预下流行病时间序列反事实预测的大规模基准。与现有基准不同,该基准支持静态与动态治疗分配,以及单策略与多策略干预场景,从而能够在广泛的因果推断情境中评估因果推断方法。基于校准的个体为本模型(该模型植根于真实的人口、流动性、流行病学及政策数据),我们生成了覆盖美国150多个县的现实反事实轨迹。利用该基准,我们对广泛应用及当前最先进的因果推断方法进行了评估,揭示了显著的性能差异,并凸显了真实时间序列因果推理的挑战。