Building accurate and interpretable Machine Learning (ML) models for heterogeneous/mixed data is a long-standing challenge for algorithms designed for numeric data. This work focuses on developing numeric coding schemes for non-numeric attributes for ML algorithms to support accurate and explainable ML models, methods for lossless visualization of n-D non-numeric categorical data with visual rule discovery in these visualizations, and accurate and explainable ML models for categorical data. This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments. This work is one of the steps to the full scope ML algorithms for mixed data supported by lossless visualization of n-D data in General Line Coordinates beyond Parallel Coordinates.
翻译:构建针对异构/混合数据的准确且可解释的机器学习模型,一直是面向数值型数据设计算法所面临的长期挑战。本研究聚焦于为机器学习算法开发面向非数值属性的数值编码方案,以支持构建准确且可解释的机器学习模型;同时,针对n维非数值类别数据开发无损可视化方法,并在此类可视化中实现视觉规则发现;此外,还致力于为类别数据构建准确且可解释的机器学习模型。本文提出了一种混合数据类型分类法,并分析了其在机器学习中的重要角色。它提供了一套工具集,用于增强机器学习算法对混合数据进行所有内部操作的可解释性,并支持对混合数据的可视化数据探索。本文提出了一种新型的面向类别数据的可解释规则生成算法——序列规则生成算法,并通过多项计算实验对其进行了成功评估。本研究是迈向支持基于通用线坐标(超越平行坐标)的n维数据无损可视化、面向混合数据的完整机器学习算法体系的重要一步。