Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce CaTS-Bench, a comprehensive benchmark for Context-aware Time Series reasoning across 11 diverse domains, centered on a gold-standard evaluation set of 1746 human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of 910 multiple-choice questions and use tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal text generation in numeric domains.
翻译:时间序列描述任务要求用自然语言表述时序数据,涉及数值与时态推理、趋势解读和语境理解。然而现有基准测试常依赖完全合成或泛化性强的标注文本,且普遍忽视元数据与视觉表征。我们提出CaTS-Bench,一个覆盖11个跨领域场景的上下文感知时间序列推理综合基准,其核心是包含1746条人工改写标注语料的黄金标准评估集——该评估集专门用于衡量模型将数值趋势转化为即时可解读文本的能力。针对人工标注数据稀缺的问题,我们同时提出可扩展的高保真合成标注语料生成流程,并验证了其质量。通过在该基准上评估主流视觉语言模型,我们发现即便专有模型也难以捕捉时序描述中的数值细节,而基于合成数据微调开源模型则可显著提升性能。最后,我们发布包含910道多选题的诊断套件,并采用定制化数值指标评估时间序列专属推理能力,从而确立CaTS-Bench作为数值领域多模态可靠文本生成的基准基石。