As the dependence on computer systems expands across various domains, focusing on personal, industrial, and large-scale applications, there arises a compelling need to enhance their reliability to sustain business operations seamlessly and ensure optimal user satisfaction. System logs generated by these devices serve as valuable repositories of historical trends and past failures. The use of machine learning techniques for failure prediction has become commonplace, enabling the extraction of insights from past data to anticipate future behavior patterns. Recently, large language models have demonstrated remarkable capabilities in tasks including summarization, reasoning, and event prediction. Therefore, in this paper, we endeavor to investigate the potential of large language models in predicting system failures, leveraging insights learned from past failure behavior to inform reasoning and decision-making processes effectively. Our approach involves leveraging data from the Intel Computing Improvement Program (ICIP) system crash logs to identify significant events and develop CrashEventLLM. This model, built upon a large language model framework, serves as our foundation for crash event prediction. Specifically, our model utilizes historical data to forecast future crash events, informed by expert annotations. Additionally, it goes beyond mere prediction, offering insights into potential causes for each crash event. This work provides the preliminary insights into prompt-based large language models for the log-based event prediction task.
翻译:随着计算机系统在个人、工业和大型应用等各个领域的依赖日益加深,迫切需要提升其可靠性,以保障业务连续运行并确保最佳的用户满意度。这些设备生成的系统日志是记录历史趋势与过往故障的宝贵信息库。利用机器学习技术进行故障预测已变得普遍,它能够从历史数据中提取洞见,以预测未来的行为模式。近年来,大语言模型在摘要生成、推理和事件预测等任务中展现出卓越的能力。因此,本文致力于探究大语言模型在预测系统故障方面的潜力,通过利用从过往故障行为中学到的知识,有效支持推理与决策过程。我们的方法利用英特尔计算改进计划(ICIP)的系统崩溃日志数据,识别关键事件并构建了CrashEventLLM。该模型基于大语言模型框架构建,是我们进行崩溃事件预测的基础。具体而言,我们的模型利用历史数据,并参考专家标注,来预测未来的崩溃事件。此外,它不仅仅局限于预测,还能为每次崩溃事件提供潜在原因的分析。本研究为基于提示的大语言模型在日志事件预测任务中的应用提供了初步的探索。