Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software

Time series anomaly detection (TSAD) is essential for ensuring the safety and reliability of aerospace software systems. Although large language models (LLMs) provide a promising training-free alternative to unsupervised approaches, their effectiveness in aerospace settings remains under-examined because of complex telemetry, misaligned evaluation metrics, and the absence of domain knowledge. To address this gap, we introduce ATSADBench, the first benchmark for aerospace TSAD. ATSADBench comprises nine tasks that combine three pattern-wise anomaly types, univariate and multivariate signals, and both in-loop and out-of-loop feedback scenarios, yielding 108,000 data points. Using this benchmark, we systematically evaluate state-of-the-art open-source LLMs under two paradigms: Direct, which labels anomalies within sliding windows, and Prediction-Based, which detects anomalies from prediction errors. To reflect operational needs, we reformulate evaluation at the window level and propose three user-oriented metrics: Alarm Accuracy (AA), Alarm Latency (AL), and Alarm Contiguity (AC), which quantify alarm correctness, timeliness, and credibility. We further examine two enhancement strategies, few-shot learning and retrieval-augmented generation (RAG), to inject domain knowledge. The evaluation results show that (1) LLMs perform well on univariate tasks but struggle with multivariate telemetry, (2) their AA and AC on multivariate tasks approach random guessing, (3) few-shot learning provides modest gains whereas RAG offers no significant improvement, and (4) in practice LLMs can detect true anomaly onsets yet sometimes raise false alarms, which few-shot prompting mitigates but RAG exacerbates. These findings offer guidance for future LLM-based TSAD in aerospace software.

翻译：时间序列异常检测（TSAD）对于确保航空航天软件系统的安全性和可靠性至关重要。尽管大型语言模型（LLMs）为无监督方法提供了一种有前景的无训练替代方案，但由于复杂的遥测数据、不匹配的评估指标以及领域知识的缺乏，它们在航空航天环境中的有效性仍未得到充分检验。为填补这一空白，我们引入了ATSADBench，这是首个面向航空航天TSAD的基准测试。ATSADBench包含九项任务，这些任务结合了三种模式级异常类型、单变量与多变量信号，以及环内和环外两种反馈场景，共产生108,000个数据点。利用此基准，我们系统评估了两种范式下的最先进开源LLMs：直接范式（在滑动窗口内标记异常）和基于预测的范式（通过预测误差检测异常）。为反映实际运维需求，我们在窗口级别重构了评估方法，并提出了三个面向用户的指标：警报准确率（AA）、警报延迟（AL）和警报连续性（AC），分别量化警报的正确性、及时性和可信度。我们进一步研究了两种增强策略——少样本学习和检索增强生成（RAG），以注入领域知识。评估结果表明：（1）LLMs在单变量任务上表现良好，但在处理多变量遥测数据时存在困难；（2）其在多变量任务上的AA和AC接近随机猜测水平；（3）少样本学习带来有限提升，而RAG未带来显著改进；（4）在实际应用中，LLMs能够检测到真实的异常起始点，但有时会产生误报，少样本提示可缓解此问题，而RAG反而会加剧误报。这些发现为未来在航空航天软件中基于LLM的TSAD研究提供了指导。