Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap, we introduce MMTS-BENCH, a comprehensive multimodal benchmark built upon a hierarchical taxonomy of time-series tasks, spanning structural awareness, feature analysis, temporal reasoning, sequence matching and cross-modal alignment. MMTS-BENCH comprises 2,424 time series question answering (TSQA) pairs across 4 subsets: Base, InWild, Match, and Align, generated through a progressive real-world QA framework and modular synthetic data construction. We conduct extensive evaluations on closed-source, open-source LLMs and existing time series adapted large language models (TS-LLMs), revealing that: (1) TS-LLMs significantly lag behind general-purpose LLMs in cross-domain generalization, (2) LLMs show weaknesses in local tasks compared to global tasks, (3) chain-of-thought (CoT) reasoning and multimodal integration substantially improve performance, and (4) the dominant factor in existing TS-LLMs remains the backbone network capability rather than the time series encoder design. MMTS-BENCH not only provides a rigorous evaluation framework but also offers clear directions for advancing LLMs toward robust, interpretable, and generalizable time-series reasoning.
翻译:时间序列数据在金融、医疗和云计算等领域至关重要,然而,现有用于评估各类大语言模型在时序任务上表现的基准仍分散且不成体系。为弥补这一空白,我们提出了MMTS-BENCH,这是一个基于时间序列任务分层分类构建的综合性多模态基准,涵盖结构感知、特征分析、时序推理、序列匹配和跨模态对齐。MMTS-BENCH包含2,424对时间序列问答数据,分布于Base、InWild、Match和Align四个子集,这些数据通过渐进式真实世界问答框架和模块化合成数据构建生成。我们对闭源、开源大语言模型以及现有的时间序列适配大语言模型进行了广泛评估,结果表明:(1)在跨领域泛化能力上,时间序列适配大语言模型显著落后于通用大语言模型;(2)与全局任务相比,大语言模型在局部任务上表现出弱点;(3)思维链推理和多模态整合能显著提升性能;(4)现有时间序列适配大语言模型的主导因素仍是其骨干网络能力,而非时间序列编码器设计。MMTS-BENCH不仅提供了一个严谨的评估框架,也为推动大语言模型实现鲁棒、可解释和可泛化的时间序列推理指明了清晰方向。