Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
翻译:现代时间序列预测几乎完全通过单条历史轨迹的被动观测来评估,这使得关于模型对非平稳性鲁棒性的主张在根本上缺乏可证伪性。我们提出向干预性、精确统计基准测试的范式转变。通过将校准的高斯观测噪声系统性地滴定到已知的混沌和随机动力学系统中,我们将预测从黑箱序列匹配游戏转变为精确的分布推断任务。由于底层数据生成过程和噪声方差在数学上是显式的,评估可以依赖精确的负对数似然和校准的分布测试,而非启发式近似。为充分利用这一框架,我们将 Fern 架构扩展为概率生成模型,该模型原生参数化对称正定(SPD)锥,无需通用雅可比建模的计算瓶颈即可输出校准的联合协方差结构。在这种严格评估下,我们发现最先进的零样本基础模型的行为与上下文复读机制一致,在非平稳性状态迁移和高噪声条件下系统性失败。相反,Fern 显式捕捉了底层动力学的不变测度与多元几何结构,恰好在大规模序列匹配模型失效之处保持了结构保真度和统计上尖锐的校准性。