Building accurate and interpretable Machine Learning (ML) models for heterogeneous/mixed data is a long-standing challenge for algorithms designed for numeric data. This work focuses on developing numeric coding schemes for non-numeric attributes for ML algorithms to support accurate and explainable ML models, methods for lossless visualization of n-D non-numeric categorical data with visual rule discovery in these visualizations, and accurate and explainable ML models for categorical data. This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments. This work is one of the steps to the full scope ML algorithms for mixed data supported by lossless visualization of n-D data in General Line Coordinates beyond Parallel Coordinates.
翻译:为异构/混合数据构建准确且可解释的机器学习模型,是面向数值型数据设计的算法长期面临的挑战。本研究聚焦于:开发适用于非数值型属性的数值编码方案,以支持机器学习算法构建准确且可解释的模型;提出n维非数值类别数据无损可视化方法,并支持在可视化结果中发现视觉规则;构建面向类别数据的准确且可解释的机器学习模型。本文对混合数据类型进行系统分类,并分析其在机器学习中的关键作用。我们提出一套工具包,通过对混合数据进行可视化探索,强制实现机器学习算法在处理混合数据时所有内部操作的可解释性。此外,提出一种用于类别数据可解释规则生成的新型序列规则生成算法,并通过多项计算实验成功验证了其有效性。本研究是在超越平行坐标的统一线性坐标框架下,实现面向混合数据的全维度机器学习算法,并支持n维数据无损可视化的关键步骤之一。