Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.
翻译:时间序列基础模型(TSFMs)引入了零样本预测能力,无需针对特定任务进行训练。然而,这些能力是否能够转化到诸如电力需求预测等关键任务应用中——在这些应用中,准确性、校准性和鲁棒性直接影响电网运行——仍然是一个悬而未决的问题。我们提出了一项多维基准测试,评估了四种TSFM模型(Chronos-Bolt、Chronos-2、Moirai-2和TinyTimeMixer),同时以Prophet作为行业标准基线,并引入两种统计参考模型(SARIMA和季节性朴素模型),所使用的数据是2020年至2024年的ERCOT小时负荷数据。所有实验均在消费级硬件(AMD Ryzen 7处理器,16GB内存,无GPU)上运行。评估涵盖四个维度:(1)上下文长度敏感性(从24小时到2048小时),(2)概率预测校准,(3)包括COVID-19封锁和冬季风暴Uri在内的分布偏移下的鲁棒性,以及(4)用于运营决策支持的规范性分析。表现最佳的基础模型在长上下文长度(C = 2048小时,提前一天预测)下实现了接近0.31的MASE值,相比季节性朴素基线降低了47%。引入Prophet模型揭示了预训练模型的结构性优势:当拟合窗口短于其季节性周期时,Prophet模型会失效(在24小时上下文下MASE > 74),而TSFMs即使在最小上下文下也能保持稳定的准确性,因为它们识别的是预训练期间学习到的时间模式,而不是从头开始估计这些模式。不同模型之间的校准差异显著——Chronos-2产生了校准良好的预测区间(在90%名义置信水平下达到95%的经验覆盖率),而Moirai-2和Prophet都表现出过度自信(覆盖率约为70%)。我们提供了实用的模型选择指南,并开源了完整的基准测试框架以确保可复现性。