High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.
翻译:高质量的时间序列预测对于实际决策至关重要。然而,传统的逐点评价指标往往难以揭示复杂的时间模式,并且与人类的主观偏好存在偏差。尽管“将大语言模型作为评判者”这一范式通过提供灵活且符合人类偏好的评判方式,革新了文本评估领域,但其在时间序列领域的应用仍有待深入探索。本文中,我们利用视觉语言模型(VLM)作为时间序列预测的评判者,发挥其基于文本信息理解时间序列图像的能力。具体而言,我们提出了一种新颖的框架,该框架结合基于上下文信息的微观与宏观层级的评判,以评估时间序列预测。为此,我们引入了TimeVista——一个全面的“视觉语言模型作为评判者”基准数据集,包含5563个时间序列样本及其详细的评估准则。广泛的元评估表明,视觉语言模型是高度可靠的评判者,其与人类偏好的一致性显著优于传统指标。基于我们的基准,我们进一步在“视觉语言模型作为评判者”范式下全面评估了近期的时间序列基础模型(TSFMs)。研究结果表明,视觉语言模型作为稳健且可解释的评判者,为评估时间序列模型提供了全面且符合人类偏好的标准。