To cluster, classify and represent are three fundamental objectives of learning from high-dimensional data with intrinsic structure. To this end, this paper introduces three interpretable approaches, i.e., segmentation (clustering) via the Minimum Lossy Coding Length criterion, classification via the Minimum Incremental Coding Length criterion and representation via the Maximal Coding Rate Reduction criterion. These are derived based on the lossy data coding and compression framework from the principle of rate distortion in information theory. These algorithms are particularly suitable for dealing with finite-sample data (allowed to be sparse or almost degenerate) of mixed Gaussian distributions or subspaces. The theoretical value and attractive features of these methods are summarized by comparison with other learning methods or evaluation criteria. This summary note aims to provide a theoretical guide to researchers (also engineers) interested in understanding 'white-box' machine (deep) learning methods.
翻译:聚类、分类和表示是从具有内在结构的高维数据中进行学习的三个基本目标。为此,本文介绍了三种可解释方法,即基于最小有损编码长度准则的分割(聚类)、基于最小增量编码长度准则的分类和基于最大编码率降维准则的表示。这些方法源于信息论中率失真原理下的有损数据编码与压缩框架。所提算法特别适用于处理混合高斯分布或子空间中的有限样本数据(允许稀疏或近乎退化的情况)。通过与其他学习方法或评价准则的比较,总结了这些方法的理论价值和突出特性。本文旨在为对理解“白盒”机器(深度)学习方法感兴趣的研究人员(及工程师)提供理论指导。