MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTbench comprises paired time series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTbench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTbench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model's ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.

翻译：理解文本新闻与时间序列演化之间的关系是应用数据科学中一个关键但尚未充分探索的挑战。尽管多模态学习已受到关注，但现有的多模态时间序列数据集在评估跨模态推理和复杂问答方面存在不足，而这些能力对于捕捉叙事信息与时间模式之间的复杂交互至关重要。为弥补这一差距，我们引入了多模态时间序列基准（MTBench），这是一个大规模基准，旨在评估大语言模型（LLMs）在金融和气象领域的时间序列与文本理解能力。MTBench包含配对的时间序列与文本数据，包括金融新闻及其对应的股价变动，以及与历史温度记录对齐的气象报告。与专注于孤立模态的现有基准不同，MTBench为模型提供了一个全面的测试平台，使其能够对结构化数值趋势和非结构化文本叙述进行联合推理。MTBench的丰富性支持构建多种任务，这些任务需要对文本和时间序列数据有深入理解，包括时间序列预测、语义与技术趋势分析，以及新闻驱动的问答（QA）。这些任务旨在测试模型捕捉时间依赖关系、从文本上下文中提取关键见解以及整合跨模态信息的能力。我们在MTBench上评估了最先进的大语言模型，分析了它们在建模新闻叙事与时间模式之间复杂关系方面的有效性。我们的研究揭示了当前模型面临的重大挑战，包括捕捉长期依赖关系、解释金融和气象趋势中的因果关系以及有效融合多模态信息等方面的困难。