This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.
翻译:本文介绍了intRinsic,这是一个R包,实现了当前最先进的基于似然估计的数据集本征维数估计方法——该维数对于大多数降维技术而言是一个关键量。为使这些新方法易于使用,该包包含少量高级函数,这些函数依赖于一组广泛的高效底层例程。总体而言,intRinsic涵盖了两种类型的模型:同质性与异质性本征维数估计器。第一类包含双近邻估计器,这是一种基于数据点与其最近两个邻居之间距离比值的分布性质推导出的方法。专用于该方法的函数在频率学派和贝叶斯框架下均能进行推断。第二类则为异质性本征维数算法,这是一种贝叶斯混合模型,并实现了高效的吉布斯采样器。在阐述理论背景后,我们在模拟数据集上展示了模型的性能,从而通过即时评估结果有效性来促进阐述。接着,我们利用该包研究了来自著名微阵列实验的Alon数据集的本征维数。最后,我们展示了同质性与异质性本征维数的估计如何使我们能够深入了解数据集的拓扑结构。