Modern machine learning systems are increasingly trained on large amounts of data embedded in high-dimensional spaces. Often this is done without analyzing the structure of the dataset. In this work, we propose a framework to study the geometric structure of the data. We make use of our recently introduced non-negative kernel (NNK) regression graphs to estimate the point density, intrinsic dimension, and the linearity of the data manifold (curvature). We further generalize the graph construction and geometric estimation to multiple scale by iteratively merging neighborhoods in the input data. Our experiments demonstrate the effectiveness of our proposed approach over other baselines in estimating the local geometry of the data manifolds on synthetic and real datasets.
翻译:现代机器学习系统越来越多地基于嵌入高维空间的大规模数据进行训练。然而,这一过程往往缺乏对数据集结构的分析。本文提出一个研究数据几何结构的框架,通过利用我们最新构建的非负核回归图来估计点密度、内在维度及数据流形的线性度(曲率)。我们进一步通过迭代合并输入数据中的邻域,将图构建与几何估计推广至多尺度。实验表明,在合成数据集与真实数据集上,本方法在估计数据流形局部几何结构方面优于其他基线方法。