Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.

翻译：时间序列基础模型（TSFMs）引入了零样本预测能力，无需针对特定任务进行训练。然而，这些能力是否能够转化到诸如电力需求预测等关键任务应用中——在这些应用中，准确性、校准性和鲁棒性直接影响电网运行——仍然是一个悬而未决的问题。我们提出了一项多维基准测试，评估了四种TSFM模型（Chronos-Bolt、Chronos-2、Moirai-2和TinyTimeMixer），同时以Prophet作为行业标准基线，并引入两种统计参考模型（SARIMA和季节性朴素模型），所使用的数据是2020年至2024年的ERCOT小时负荷数据。所有实验均在消费级硬件（AMD Ryzen 7处理器，16GB内存，无GPU）上运行。评估涵盖四个维度：（1）上下文长度敏感性（从24小时到2048小时），（2）概率预测校准，（3）包括COVID-19封锁和冬季风暴Uri在内的分布偏移下的鲁棒性，以及（4）用于运营决策支持的规范性分析。表现最佳的基础模型在长上下文长度（C = 2048小时，提前一天预测）下实现了接近0.31的MASE值，相比季节性朴素基线降低了47%。引入Prophet模型揭示了预训练模型的结构性优势：当拟合窗口短于其季节性周期时，Prophet模型会失效（在24小时上下文下MASE > 74），而TSFMs即使在最小上下文下也能保持稳定的准确性，因为它们识别的是预训练期间学习到的时间模式，而不是从头开始估计这些模式。不同模型之间的校准差异显著——Chronos-2产生了校准良好的预测区间（在90%名义置信水平下达到95%的经验覆盖率），而Moirai-2和Prophet都表现出过度自信（覆盖率约为70%）。我们提供了实用的模型选择指南，并开源了完整的基准测试框架以确保可复现性。