Building accurate and interpretable Machine Learning (ML) models for heterogeneous/mixed data is a long-standing challenge for algorithms designed for numeric data. This work focuses on developing numeric coding schemes for non-numeric attributes for ML algorithms to support accurate and explainable ML models, methods for lossless visualization of n-D non-numeric categorical data with visual rule discovery in these visualizations, and accurate and explainable ML models for categorical data. This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments. This work is one of the steps to the full scope ML algorithms for mixed data supported by lossless visualization of n-D data in General Line Coordinates beyond Parallel Coordinates.
翻译:为异构/混合数据构建准确且可解释的机器学习(ML)模型一直是专为数值数据设计的算法所面临的长期挑战。本研究着力于开发面向ML算法的非数值属性数值编码方案,以支持构建准确可解释的ML模型;提出面向n维非数值分类数据的无损可视化方法,并在这些可视化中实现视觉规则发现;同时构建面向分类数据准确且可解释的ML模型。本研究提出一种混合数据类型分类法,并分析其在机器学习中的重要作用。它提供了一套工具包,用以强化ML算法在混合数据上所有内部操作的可解释性,并支持混合数据的可视化探索。本文提出一种新型序列规则生成(SRG)算法,用于面向分类数据的可解释规则生成,并通过多次计算实验对其进行了成功验证。本研究是迈向全面支持混合数据ML算法的关键步骤之一,该类算法依托于超越平行坐标的通用线坐标中n维数据的无损可视化。