Modern database management systems (DBMS) face significant challenges in maintaining performance and availability under dynamic workloads. This paper proposes a novel self-healing framework that integrates Model-Agnostic Meta-Learning (MAML) for few-shot anomaly detection, Graph Neural Networks (GNNs) for dependency-driven cascading failure prediction, and multi-objective Reinforcement Learning (RL) for autonomous recovery. Unlike existing database tuning systems that focus primarily on offline configuration optimization, our framework enables real-time, end-to-end self-healing by rapidly adapting to unseen workload patterns with minimal labeled data. We introduce dynamic GNN-based dependency modeling that captures workload-dependent relationships between database components, enabling proactive cascade prevention. A scalarized multi-objective RL formulation balances latency, resource utilization, and cost during recovery, while SHAP-based explainability ensures operational transparency. Evaluations on Google Cluster Data and TPC benchmarks demonstrate 90.5\% anomaly detection F1-score with 5-shot adaptation, 90.1\% cascade prediction accuracy, and 85.1\% latency reduction in recovery actions, outperforming strong baselines including Isolation Forest, LSTM autoencoders, static GCN, and standard RL methods.
翻译:现代数据库管理系统(DBMS)在面对动态工作负载时,维持性能与可用性面临重大挑战。本文提出一种新颖的自愈框架,集成了模型无关元学习(MAML)用于少样本异常检测、图神经网络(GNN)用于依赖驱动的级联故障预测,以及多目标强化学习(RL)用于自主恢复。与主要关注离线配置优化的现有数据库调优系统不同,本框架通过以最小标注数据快速适应未见工作负载模式,实现了实时端到端自愈。我们引入基于动态GNN的依赖建模,捕获数据库组件间的工作负载依赖关系,从而实现主动级联预防。标量化多目标强化学习公式在恢复过程中平衡延迟、资源利用率和成本,同时基于SHAP的可解释性确保操作透明度。在谷歌集群数据和TPC基准测试上的评估表明,该方法在5次样本适应下异常检测F1分数达90.5%,级联预测准确率达90.1%,恢复操作延迟降低85.1%,优于包括孤立森林、LSTM自编码器、静态GCN和标准强化学习方法在内的强基线模型。