CaTS-Bench：语言模型能否描述时间序列？ (CaTS-Bench: Can Language Models Describe Time Series?)

Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce \textbf{CaTS-Bench}, a comprehensive benchmark for \textbf{C}ontext-\textbf{a}ware \textbf{T}ime \textbf{S}eries reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.

翻译：时间序列描述（Time series captioning）旨在用自然语言描述时间序列数据，该任务要求具备数值与时间推理、趋势解读以及上下文理解能力。然而，现有基准测试通常依赖完全合成或通用描述，且往往忽略元数据与视觉表征。本文提出\textbf{CaTS-Bench}，这是一个覆盖$11$个不同领域的\textbf{上下文感知时间序列推理}综合基准，其核心是一个包含$1746$条人工重写描述的黄金标准评估集，用于衡量模型将数值趋势转化为即时可解读叙述的有效性。为解决人工标注数据稀缺的问题，我们还提出一个可扩展的流水线，用于生成高保真合成描述，并对其质量进行了验证。我们在该基准上评估了领先的视觉-语言模型，结果表明即使专有模型也难以捕捉时序描述中的数值细微差异，而在合成数据上对开源模型进行微调则可带来显著的性能提升。最后，我们发布了包含$910$道选择题的诊断套件及定制化数值评估指标，以衡量时间序列特有的推理能力，从而将CaTS-Bench确立为数值领域中基于多模态语言生成的可靠基础。