Root cause analysis for enterprise database incidents is often a manual and time consuming process that requires operators to inspect logs, performance metrics, and workload behavior. Existing approaches commonly focus on a single source of evidence, which limits their ability to capture the broader operational context behind incidents such as CPU saturation, I/O bottlenecks, lock contention, deadlocks, and slow query execution. This paper presents a multimodal machine learning framework for workload-aware root cause analysis in enterprise database environments. The proposed approach combines workload characteristics, system telemetry, and operational signals from compute, storage, and accelerator oriented datasets. Engineered workload aware features are used to classify workload behavior and support downstream diagnosis of likely incident causes. The framework evaluates Random Forest, LightGBM, and feedforward neural network models for workload classification and root cause analysis support. Experimental results show that workload aware feature engineering improves workload separability, with LightGBM providing the strongest balance of predictive performance and interpretability. The results suggest that combining multimodal telemetry with workload context can provide a practical foundation for automated and explainable root cause analysis systems.
翻译:企业数据库事件的根因分析通常是一个耗时且需手动操作的过程,要求运维人员检查日志、性能指标和工作负载行为。现有方法通常仅关注单一证据源,导致其难以捕捉事件背后更广泛的运维上下文,例如CPU饱和、I/O瓶颈、锁竞争、死锁及慢查询执行等。本文提出了一种面向企业数据库环境的工作负载感知根因分析多模态机器学习框架。该方案融合了工作负载特征、系统遥测数据以及来自计算、存储和加速器导向数据集的运维信号。通过设计工作负载感知特征来分类工作负载行为,并支持对事件可能原因的后续诊断。该框架评估了随机森林、LightGBM和前馈神经网络模型在工作负载分类及根因分析支持方面的性能。实验结果表明,工作负载感知特征工程有效提升了工作负载的可分离性,其中LightGBM在预测性能与可解释性之间取得了最佳平衡。结果证实,将多模态遥测数据与工作负载上下文相结合,可为自动化且可解释的根因分析系统提供实用基础。