Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.
翻译:时间序列数据在金融、医疗保健、交通和环境科学等关键领域的应用中不可或缺。尽管近期研究已开始探索多任务时间序列问答,但现有基准仍局限于预测和异常检测任务。我们提出了TSAQA,这是一个新颖的统一基准,旨在拓宽任务覆盖范围并评估多样化的时序分析能力。TSAQA将六种不同的任务整合在单一框架下,范围涵盖从传统分析(包括异常检测和分类)到高级分析(如特征描述、比较、数据转换和时序关系分析)。该数据集横跨13个领域,包含21万个样本,并采用多种格式,包括判断题、选择题以及新颖的谜题形式,以全面评估时间序列分析能力。零样本评估表明,这些任务对当前的大型语言模型具有挑战性:表现最佳的商业模型Gemini-2.5-Flash的平均得分仅为65.08。尽管指令微调提升了开源模型的性能:表现最佳的开源模型LLaMA-3.1-8B仍有显著的改进空间,这凸显了LLMs处理时序分析的复杂性。