LLM-as-a-Judge for Time Series Explanations

Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

翻译：对基于时间序列数据生成的自然语言解释进行事实正确性评估仍是一个未解决的挑战。尽管现代模型能生成数值信号的文本解释，现有评估方法存在局限：基于参考的相似度指标和一致性检查模型需要真实解释，而传统时间序列方法仅处理数值数据，无法评估自由形式的文本推理。因此，目前尚无通用方法能在无需预定义参考或任务特定规则的情况下，直接验证解释是否忠实于底层时间序列数据。我们研究将大语言模型同时作为时间序列解释的生成器与评估器，在无参考设定下，给定时间序列、问题及候选解释后，评估器基于模式识别、数值精度与答案忠实性分配三元正确性标签，从而实现基于原则的评分与比较。为支持该框架，我们构建了一个包含350个时间序列案例（覆盖七种查询类型）的合成基准，每个案例配以正确、部分正确与错误三种解释。我们在四项任务中评估模型：解释生成、相对排序、独立评分与多异常检测。结果表明存在明显的不对称性：生成过程高度依赖模式，并在特定查询类型上出现系统性失败，其中季节性下降与波动性偏移的准确率低至0.00-0.12，而结构性断点的准确率可达0.94-0.96；而评估过程更为稳定，即使模型自身输出错误，仍能正确排序和评分解释。这些发现证明了基于时间序列数据的大语言模型评估方法在时间序列解释中的可行性，并突显了其作为该领域数据驱动推理可靠评估器的潜力。