Classification is a classic problem but encounters lots of challenges when dealing with a large number of features, which is common in many modern applications, such as identifying tumor sub-types from genomic data or categorizing customer attitudes based on on-line reviews. We propose a new framework that utilizes the ranks of pairwise distances among observations and identifies a common pattern under moderate to high dimensions that has been overlooked before. The proposed method exhibits superior classification power over existing methods under a variety of scenarios. Furthermore, the proposed method can be applied to non-Euclidean data objects, such as network data. We illustrate the method through an analysis of Neuropixels data where neurons are classified based on their firing activities. Additionally, we explore a related approach that is simpler to understand and investigates key quantities that play essential roles in our novel approach.
翻译:分类是一个经典问题,但在处理大量特征时面临诸多挑战,这在许多现代应用中普遍存在,例如从基因组数据中识别肿瘤亚型或根据在线评论对客户态度进行分类。我们提出一种新框架,利用观测值之间成对距离的排序,并识别出在中等至高维空间中先前被忽视的共性模式。所提方法在多种场景下展现出优于现有方法的分类能力。此外,该方法可应用于非欧几里得数据对象(如网络数据)。我们通过分析Neuropixels数据(基于神经元放电活动进行分类)来阐释该方法。同时,我们探索了一种更易理解的相关方法,并研究了其在我们新颖方法中起关键作用的若干核心量。