The increasing complexity of computer systems necessitates innovative approaches to fault and error management, going beyond traditional manual log analysis. While existing solutions using large language models (LLMs) show promise, they are limited by a gap between natural and domain-specific languages, which restricts their effectiveness in real-world applications. Our approach addresses these limitations by integrating interpretable domain knowledge into open-source LLMs through continual pre-training (CPT), enhancing performance on log tasks while retaining natural language processing capabilities. We created a comprehensive dataset, NLPLog, with over 250,000 question-answer pairs to facilitate this integration. Our model, SuperLog, trained with this dataset, achieves the best performance across four log analysis tasks, surpassing the second-best model by an average of 12.01%. Our contributions include a novel CPT paradigm that significantly improves model performance, the development of SuperLog with state-of-the-art results, and the release of a large-scale dataset to support further research in this domain.
翻译:计算机系统日益增长的复杂性要求我们超越传统的手动日志分析,采用创新的故障与错误管理方法。尽管现有基于大语言模型(LLMs)的解决方案展现出潜力,但其受限于自然语言与领域特定语言之间的鸿沟,这制约了其在实际应用中的有效性。我们的方法通过持续预训练(CPT)将可解释的领域知识整合到开源大语言模型中,从而在保持自然语言处理能力的同时提升日志任务性能。为此,我们构建了包含超过25万组问答对的综合数据集NLPLog以促进这种整合。基于该数据集训练的模型SuperLog在四项日志分析任务中均取得最佳性能,平均超越次优模型12.01%。我们的贡献包括:显著提升模型性能的新型CPT范式、实现最先进结果的SuperLog模型,以及为支持该领域进一步研究而发布的大规模数据集。