System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.
翻译:系统日志在维护软件系统可靠性中起着关键作用。大量研究探索了基于日志的自动化异常检测方法,并在基准数据集上取得了显著准确率。然而,当应用于大规模云系统时,这些解决方案因资源消耗高且缺乏对日志演化的适应性而面临局限。本文提出一种精确、轻量且自适应的日志异常检测框架SeaLog。该方法引入基于字典树的检测代理(Trie-based Detection Agent, TDA),采用轻量级且可动态增长的字典树结构实现实时异常检测。为提升TDA应对日志数据演化的准确性,我们使其能够接收领域专家反馈。有趣的是,研究发现当代大语言模型(如ChatGPT)可提供与人类专家一致性相当的反馈,这或能减少人工验证工作量。我们在两个公开数据集及一个工业数据集上对SeaLog进行了全面评估。结果表明,SeaLog在有效性上优于所有基线方法,运行速度提升2至10倍,内存消耗仅为对比方法的5%至41%。