How to interpret a data mining model has received much attention recently, because people may distrust a black-box predictive model if they do not understand how the model works. Hence, it will be trustworthy if a model can provide transparent illustrations on how to make the decision. Although many rule-based interpretable classification algorithms have been proposed, all these existing solutions cannot directly construct an interpretable model to provide personalized prediction for each individual test sample. In this paper, we make a first step towards formally introducing personalized interpretable classification as a new data mining problem to the literature. In addition to the problem formulation on this new issue, we present a greedy algorithm called PIC (Personalized Interpretable Classifier) to identify a personalized rule for each individual test sample. To improve the running efficiency, a fast approximate algorithm called fPIC is presented as well. To demonstrate the necessity, feasibility and advantages of such a personalized interpretable classification method, we conduct a series of empirical studies on real data sets. The experimental results show that: (1) The new problem formulation enables us to find interesting rules for test samples that may be missed by existing non-personalized classifiers. (2) Our algorithms can achieve the same-level predictive accuracy as those state-of-the-art (SOTA) interpretable classifiers. (3) On a real data set for predicting breast cancer metastasis, such personalized interpretable classifiers can outperform SOTA methods in terms of both accuracy and interpretability.
翻译:数据挖掘模型的可解释性近来备受关注,因为若无法理解模型的工作原理,人们可能不信任黑盒预测模型。因此,若模型能提供透明的决策过程说明,则将更具可信度。尽管已有许多基于规则的可解释分类算法被提出,但现有方案均无法直接构建可解释模型为每个测试样本提供个性化预测。本文首次在学术界正式提出"个性化可解释分类"这一新的数据挖掘问题。除给出该问题的形式化定义外,我们提出名为PIC(个性化可解释分类器)的贪心算法,为每个测试样本识别个性化规则。为提升运行效率,同时提出快速近似算法fPIC。为验证这种个性化可解释分类方法的必要性、可行性和优势,我们在真实数据集上开展系列实证研究。实验结果表明:(1)新问题框架能发现测试样本中有趣的规则,这些规则可能被现有非个性化分类器遗漏;(2)我们的算法能达到与当前最先进可解释分类器相当的预测精度;(3)在预测乳腺癌转移的真实数据集上,此类个性化可解释分类器在准确性和可解释性方面均优于最先进方法。