Modern database management systems (DBMS) face significant challenges in maintaining performance and availability under dynamic workloads. This paper proposes a novel self-healing framework that integrates Model-Agnostic Meta-Learning (MAML) for few-shot anomaly detection, Graph Neural Networks (GNNs) for dependency-driven cascading failure prediction, and multi-objective Reinforcement Learning (RL) for autonomous recovery. Unlike existing database tuning systems that focus primarily on offline configuration optimization, our framework enables real-time, end-to-end self-healing by rapidly adapting to unseen workload patterns with minimal labeled data. We introduce dynamic GNN-based dependency modeling that captures workload-dependent relationships between database components, enabling proactive cascade prevention. A scalarized multi-objective RL formulation balances latency, resource utilization, and cost during recovery, while SHAP-based explainability ensures operational transparency. Evaluations on Google Cluster Data and TPC benchmarks demonstrate 90.5\% anomaly detection F1-score with 5-shot adaptation, 90.1\% cascade prediction accuracy, and 85.1\% latency reduction in recovery actions, outperforming strong baselines including Isolation Forest, LSTM autoencoders, static GCN, and standard RL methods.
翻译:现代数据库管理系统(DBMS)在动态工作负载下维持高性能与高可用性面临重大挑战。本文提出一种新型自愈框架,该框架整合了用于少样本异常检测的模型无关元学习(MAML)、用于依赖驱动的级联故障预测的图神经网络(GNN)以及用于自主恢复的多目标强化学习(RL)。与现有主要侧重于离线配置优化的数据库调优系统不同,本框架通过利用极少量标注数据快速适应未见工作负载模式,实现了实时端到端自愈。我们引入基于动态GNN的依赖建模技术,能够捕捉数据库组件之间受工作负载影响的关系,从而实现主动级联预防。采用标量化的多目标强化学习公式在恢复过程中平衡延迟、资源利用率与成本,同时基于SHAP的可解释性确保了运维透明度。在Google集群数据和TPC基准测试上的评估表明,该方法在5样本自适应条件下达到90.5%的异常检测F1分数,90.1%的级联预测准确率,以及85.1%的恢复动作延迟降低,显著优于包括孤立森林、LSTM自编码器、静态GCN及标准强化学习方法在内的强基线模型。