Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
翻译:现实世界中以离散特征为特征的数据集无处不在:从分类调查到临床问卷,从无权网络到DNA序列。然而,最常见的无监督降维方法是为连续空间设计的,将其用于离散空间可能导致误差和偏差。本文介绍了一种推断嵌入在离散空间中数据集本征维数(ID)的算法。我们在基准数据集上展示了其准确性,并将其应用于分析用于物种指纹识别的宏基因组数据集,发现了一个令人惊讶的小的ID,量级约为2。这表明,尽管序列空间具有高维性,进化压力作用在一个低维流形上。