Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT-NLP/LLM4TSF.
翻译:大语言模型(LLMs)被引入时间序列预测(TSF)领域,旨在整合超越数值信号的上下文知识。然而,现有研究质疑LLMs是否真正带来益处,通常报告不使用LLMs时性能相当。我们证明此类结论源于有限的评估设置,在更大规模下并不成立。我们对基于LLM的TSF(LLM4TSF)进行了大规模研究,涵盖80亿个观测值、17种预测场景、4种预测范围、多种对齐策略以及域内和域外设置。我们的结果表明,\emph{LLM4TS确实提升了预测性能},尤其在跨域泛化方面增益显著。预对齐在超过90\%的任务中优于后对齐。LLMs的预训练知识和模型架构均发挥作用并扮演互补角色:预训练在分布偏移下至关重要,而架构在建模复杂时间动态方面表现优异。此外,在大规模混合分布下,一个完全保持原状的LLM变得不可或缺,这通过令牌级路由分析和基于提示的改进得到证实。总体而言,我们的研究结果推翻了先前的负面评估,明确了LLMs不仅有用且发挥作用的清晰条件,并为有效的模型设计提供了实用指导。我们在https://github.com/EIT-NLP/LLM4TSF发布了代码。