High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems.
翻译:高维小样本(HDLSS)问题在机器学习的实际应用中普遍存在。从医学图像到文本处理,传统机器学习算法通常难以从这类数据中学习到最优概念。在先前工作中,我们提出了一种基于差异性的多视角分类方法——随机森林差异性(RFD),该方法在此类问题上取得了领先水平的结果。本研究将此核心原理迁移至解决HDLSS分类问题,通过采用RF相似度度量作为预训练的SVM核函数(RFSVM)。结果表明,这种学习型相似度度量特别适用于该分类场景且具有较高精度。基于严格统计分析支持的40个公开HDLSS分类数据集实验显示,RFSVM方法在大多数HDLSS问题上优于现有方法,同时在低维或非HDLSS问题上仍保持较强竞争力。