Within today's large-scale systems, one anomaly can impact millions of users. Detecting such events in real-time is essential to maintain the quality of services. It allows the monitoring team to prevent or diminish the impact of a failure. Logs are a core part of software development and maintenance, by recording detailed information at runtime. Such log data are universally available in nearly all computer systems. They enable developers as well as system maintainers to monitor and dissect anomalous events. For Cloud computing companies and large online platforms in general, growth is linked to the scaling potential. Automatizing the anomaly detection process is a promising way to ensure the scalability of monitoring capacities regarding the increasing volume of logs generated by modern systems. In this paper, we will introduce MoniLog, a distributed approach to detect real-time anomalies within large-scale environments. It aims to detect sequential and quantitative anomalies within a multi-source log stream. MoniLog is designed to structure a log stream and perform the monitoring of anomalous sequences. Its output classifier learns from the administrator's actions to label and evaluate the criticality level of anomalies.
翻译:在现代大规模系统中,一次异常可能影响数百万用户。实时检测此类事件对于维持服务质量至关重要,它使监控团队能够预防或减轻故障的影响。日志作为软件开发和维护的核心组成部分,可在运行时记录详细信息。这类日志数据几乎在所有计算机系统中普遍可用,使开发人员及系统维护者能够监控和分析异常事件。对于云计算公司和大型在线平台而言,其发展与扩展能力息息相关。自动化异常检测流程是确保监控能力与当代系统产生的海量日志规模相匹配的有效途径。本文介绍了一种名为MoniLog的分布式方法,用于在大规模环境中实时检测异常。该方法旨在多源日志流中检测序列异常和定量异常。MoniLog的设计目标是结构化日志流并执行异常序列监控,其输出分类器通过学习管理员的行为来标注异常并评估其严重等级。