Machine learning models often deteriorate in their performance when they are used to predict the outcomes over data on which they were not trained. These scenarios can often arise in real world when the distribution of data changes gradually or abruptly due to major events like a pandemic. There have been many attempts in machine learning research to come up with techniques that are resilient to such Concept drifts. However, there is no principled framework to identify the drivers behind the drift in model performance. In this paper, we propose a novel framework - DBShap that uses Shapley values to identify the main contributors of the drift and quantify their respective contributions. The proposed framework not only quantifies the importance of individual features in driving the drift but also includes the change in the underlying relation between the input and output as a possible driver. The explanation provided by DBShap can be used to understand the root cause behind the drift and use it to make the model resilient to the drift.
翻译:机器学习模型在用于预测其未训练过的数据结果时,性能往往会下降。当数据分布因流行病等重大事件而逐渐或突然变化时,此类情况在现实世界中经常发生。机器学习研究中已有许多尝试,旨在提出能够抵御此类概念漂移的技术。然而,目前仍缺乏一个原则性框架来识别模型性能漂移背后的驱动因素。本文提出一种新颖框架——DBShap,它利用沙普利值识别漂移的主要贡献者并量化其各自贡献。该框架不仅量化了单个特征在驱动漂移中的重要性,还将输入与输出之间潜在关系的变化作为可能的驱动因素纳入考虑。DBShap提供的解释可用于理解漂移的根本原因,并据此增强模型对漂移的韧性。