Dimensionality reduction is a fundamental task in modern data science. Several projection methods specifically tailored to take into account the non-linearity of the data via local embeddings have been proposed. Such methods are often based on local neighbourhood structures and require tuning the number of neighbours that define this local structure, and the dimensionality of the lower-dimensional space onto which the data are projected. Such choices critically influence the quality of the resulting embedding. In this paper, we exploit a recently proposed intrinsic dimension estimator which also returns the optimal locally adaptive neighbourhood sizes according to some desirable criteria. In principle, this adaptive framework can be employed to perform an optimal hyper-parameter tuning of any dimensionality reduction algorithm that relies on local neighbourhood structures. Numerical experiments on both real-world and simulated datasets show that the proposed method can be used to significantly improve well-known projection methods when employed for various learning tasks, with improvements measurable through both quantitative metrics and the quality of low-dimensional visualizations.
翻译:降维是现代数据科学中的一项基础任务。已有多种投影方法被提出,这些方法专门通过局部嵌入来考虑数据的非线性特性。此类方法通常基于局部邻域结构,需要调整定义该局部结构的邻域数量以及数据投影到的低维空间的维度。这些选择对最终嵌入的质量具有关键影响。本文利用一种最近提出的本征维度估计器,该估计器还能根据某些理想准则返回最优的局部自适应邻域大小。原则上,这种自适应框架可用于对任何依赖局部邻域结构的降维算法进行最优超参数调优。在真实世界数据集和模拟数据集上的数值实验表明,所提方法能够显著改进多种知名投影方法在不同学习任务中的表现,其改进程度可通过定量指标和低维可视化质量进行度量。