Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
翻译:基于提示的评估表明,大型语言模型(LLMs)在时间序列分类任务上表现不佳,这引发了关于它们是否编码了有意义的时序结构的质疑。我们通过直接比较提示输出与基于相同内部表征的线性探针,证明这一结论反映的是提示生成方法的局限性,而非模型表征能力的不足。虽然零样本提示的表现接近随机水平,但线性探针将平均F1分数从0.15-0.26提升至0.61-0.67,通常达到甚至超过专门的时间序列模型。分层分析进一步表明,具有类别区分性的时间序列信息在Transformer的早期层中就已出现,并可通过视觉和多模态输入得到增强。这些结果共同表明,LLMs内部表征的内容与基于提示的评估所揭示的内容之间存在系统性不匹配,导致当前评估方法低估了它们对时间序列的理解能力。