Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.
翻译:信息技术已成为各行业的关键组成部分,导致对软件维护和监控的重视日益增加。随着现代软件系统的复杂性,传统维护方法已显不足。AIOps概念应运而生,旨在利用大数据和机器学习能力增强预测性维护。然而,利用AIOps需要应对数据与事件管理复杂性相关的多项挑战。尽管存在商业解决方案,但由于高成本、数据治理问题以及无法覆盖私有软件等限制,这些方案可能不适用于某些企业。本文研究了通过利用开源工具实施本地AIOps解决方案的可行性。我们介绍了已成功部署于公司内部的综合AIOps基础设施,并阐述了构建其各组件时不同选择背后的依据。特别地,我们提供了数据管理系统选择的方法与标准,并解释了其集成过程。本经验对于希望以现代AIOps方法内部管理软件维护流程的企业具有参考价值。