To increase the interpretability and prediction accuracy of the Machine Learning (ML) models, visualization of ML models is a key part of the ML process. Decision Trees (DTs) are essential in machine learning (ML) because they are used to understand many black box ML models including Deep Learning models. In this research, two new methods for creation and enhancement with complete visualizing Decision Trees as understandable models are suggested. These methods use two versions of General Line Coordinates (GLC): Bended Coordinates (BC) and Shifted Paired Coordinates (SPC). The Bended Coordinates are a set of line coordinates, where each coordinate is bended in a threshold point of the respective DT node. In SPC, each n-D point is visualized in a set of shifted pairs of 2-D Cartesian coordinates as a directed graph. These new methods expand and complement the capabilities of existing methods to visualize DT models more completely. These capabilities allow us to observe and analyze: (1) relations between attributes, (2) individual cases relative to the DT structure, (3) data flow in the DT, (4) sensitivity of each split threshold in the DT nodes, and (5) density of cases in parts of the n-D space. These features are critical for DT models' performance evaluation and improvement by domain experts and end users as they help to prevent overgeneralization and overfitting of the models. The advantages of this methodology are illustrated in the case studies on benchmark real-world datasets. The paper also demonstrates how to generalize them for decision tree visualizations in different General Line Coordinates.
翻译:为提升机器学习(ML)模型的可解释性与预测精度,ML模型的可视化是机器学习流程中的关键环节。决策树(DT)在机器学习中至关重要,因其常被用于理解包括深度学习模型在内的诸多黑箱模型。本研究提出了两种新方法,旨在将决策树创建为可理解模型并完成其完整可视化增强。这两种方法分别采用通用线坐标(GLC)的两个变体:弯曲坐标(BC)与移位配对坐标(SPC)。弯曲坐标是一组线坐标,其中每条坐标在对应决策树节点的阈值点处发生弯曲;而在SPC中,每个n维点被可视化为二维笛卡尔坐标移位配对集合中的有向图。这些新方法扩展并补充了现有方法对决策树模型进行更完整可视化的能力,使研究者能够观察与分析:(1)属性间关系;(2)相对于决策树结构的个例分布;(3)决策树中的数据流;(4)决策树节点中各分割阈值的敏感性;(5)n维空间中个案分布的密集程度。这些特性对领域专家和终端用户评估及改进决策树模型性能至关重要,有助于防止模型的过度泛化与过拟合。本方法的优势已在基准真实数据集案例研究中得到验证。论文同时展示了如何将其推广至不同通用线坐标体系下的决策树可视化。