High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems.
翻译:高维度、低样本量(HDLSS)问题在机器学习的实际应用中广泛存在。从医学图像到文本处理,传统机器学习算法通常难以从这类数据中学习到最优概念。在先前工作中,我们提出了一种基于相异性度量的多视角分类方法——随机森林相异性度量(RFD),该方法在此类问题上取得了最优结果。本研究将此方法的核心原理迁移至解决HDLSS分类问题,通过将RF相似性度量作为预计算的支持向量机内核(RFSVM)。研究表明,这种学习型相似性度量特别适用于该分类场景且具有高精度。基于40个公开HDLSS分类数据集的实验,配合严谨的统计分析表明,RFSVM方法在大多数HDLSS问题上优于现有方法,同时在低或非HDLSS问题上保持极强的竞争力。