Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting

Recent advances in time-series forecasting increasingly rely on pre-trained foundation-style models. While these models often claim broad generalization, existing evaluation protocols provide limited evidence. Indeed, most current benchmarks use static train-test splits that can easily lead to contamination as foundation models can inadvertently train on test data or perform model selection using test scores, which can inflate performance. We introduce Impermanent, a live benchmark that evaluates forecasting models under open-world temporal change by scoring forecasts sequentially over time on continuously updated data streams, enabling the study of temporal robustness, distributional shift, and performance stability rather than one-off accuracy on a frozen test set. Impermanent is instantiated on GitHub open-source activity, providing a naturally live and highly non-stationary dataset shaped by releases, shifting contributor behavior, platform/tooling changes, and external events. We focus on the top 400 repositories by star count and construct time series from issues opened, pull requests opened, push events, and new stargazers, evaluated over a rolling window with daily updates, alongside standardized protocols and leaderboards for reproducible, ongoing comparison. By shifting evaluation from static accuracy to sustained performance, Impermanent takes a concrete step toward assessing when and whether foundation-level generalization in time-series forecasting can be meaningfully claimed. Code and a live dashboard are available at https://github.com/TimeCopilot/impermanent and https://impermanent.timecopilot.dev.

翻译：近年来，时间序列预测领域的进展日益依赖于预训练的基础模型。尽管这些模型通常宣称具有广泛的泛化能力，但现有的评估协议提供的证据有限。事实上，当前大多数基准测试使用静态的训练-测试分割，这很容易导致数据污染，因为基础模型可能会无意中在测试数据上进行训练，或者使用测试分数进行模型选择，从而夸大性能表现。我们提出了Impermanent，这是一个动态基准，通过在持续更新的数据流上随时间顺序地对预测进行评分，来评估模型在开放世界时序变化下的预测能力，从而能够研究时序鲁棒性、分布偏移和性能稳定性，而非仅在固定测试集上的一次性准确性。Impermanent基于GitHub开源活动实例化，提供了一个天然动态且高度非平稳的数据集，其变化受到版本发布、贡献者行为转变、平台/工具变更以及外部事件的影响。我们聚焦于按星标数排名前400的代码仓库，并从已开启的问题、已开启的拉取请求、推送事件和新加星标者中构建时间序列，在每日更新的滚动窗口上进行评估，同时提供标准化的协议和排行榜，以实现可复现的持续比较。通过将评估重点从静态准确性转向持续性能，Impermanent朝着评估时间序列预测中基础级泛化能力何时以及是否能够被有效宣称，迈出了具体的一步。代码和动态仪表板可在 https://github.com/TimeCopilot/impermanent 和 https://impermanent.timecopilot.dev 获取。