Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software

Time series anomaly detection (TSAD) is essential for ensuring the safety and reliability of aerospace software systems. Although large language models (LLMs) provide a promising training-free alternative to unsupervised approaches, their effectiveness in aerospace settings remains under-examined because of complex telemetry, misaligned evaluation metrics, and the absence of domain knowledge. To address this gap, we introduce ATSADBench, the first benchmark for aerospace TSAD. ATSADBench comprises nine tasks that combine three pattern-wise anomaly types, univariate and multivariate signals, and both in-loop and out-of-loop feedback scenarios, yielding 108,000 data points. Using this benchmark, we systematically evaluate state-of-the-art open-source LLMs under two paradigms: Direct, which labels anomalies within sliding windows, and Prediction-Based, which detects anomalies from prediction errors. To reflect operational needs, we reformulate evaluation at the window level and propose three user-oriented metrics: Alarm Accuracy (AA), Alarm Latency (AL), and Alarm Contiguity (AC), which quantify alarm correctness, timeliness, and credibility. We further examine two enhancement strategies, few-shot learning and retrieval-augmented generation (RAG), to inject domain knowledge. The evaluation results show that (1) LLMs perform well on univariate tasks but struggle with multivariate telemetry, (2) their AA and AC on multivariate tasks approach random guessing, (3) few-shot learning provides modest gains whereas RAG offers no significant improvement, and (4) in practice LLMs can detect true anomaly onsets yet sometimes raise false alarms, which few-shot prompting mitigates but RAG exacerbates. These findings offer guidance for future LLM-based TSAD in aerospace software.

翻译：时间序列异常检测对于保障航空航天软件系统的安全性与可靠性至关重要。尽管大语言模型为无监督方法提供了一种无需训练的有前景替代方案，但由于复杂的遥测数据、不匹配的评估指标以及领域知识的缺乏，其在航空航天环境中的有效性仍未得到充分检验。为填补这一空白，我们提出了首个面向航空航天时间序列异常检测的基准测试ATSADBench。该基准包含九项任务，结合了三种模式级异常类型、单变量与多变量信号，以及回路内与回路外两种反馈场景，共生成108,000个数据点。基于此基准，我们系统评估了两种范式下的前沿开源大语言模型：直接范式（在滑动窗口内标注异常）与基于预测的范式（通过预测误差检测异常）。为反映实际运维需求，我们在窗口级别重构了评估框架，并提出三个面向用户的指标：告警准确率、告警延迟与告警连续性，分别量化告警的正确性、及时性与可信度。我们进一步研究了两种增强策略——少样本学习与检索增强生成，以注入领域知识。评估结果表明：（1）大语言模型在单变量任务上表现良好，但在处理多变量遥测数据时存在困难；（2）其在多变量任务上的告警准确率与告警连续性接近随机猜测水平；（3）少样本学习带来有限提升，而检索增强生成未产生显著改进；（4）实践中大语言模型能够检测真实异常起始点，但有时会产生误报，少样本提示可缓解此问题而检索增强生成反而会加剧误报。这些发现为未来基于大语言模型的航空航天软件时间序列异常检测研究提供了指导。