The notion of local intrinsic dimensionality (LID) is an important advancement in data dimensionality analysis, with applications in data mining, machine learning and similarity search problems. Existing distance-based LID estimators were designed for tabular datasets encompassing data points represented as vectors in a Euclidean space. After discussing their limitations for graph-structured data considering graph embeddings and graph distances, we propose NC-LID, a novel LID-related measure for quantifying the discriminatory power of the shortest-path distance with respect to natural communities of nodes as their intrinsic localities. It is shown how this measure can be used to design LID-aware graph embedding algorithms by formulating two LID-elastic variants of node2vec with personalized hyperparameters that are adjusted according to NC-LID values. Our empirical analysis of NC-LID on a large number of real-world graphs shows that this measure is able to point to nodes with high link reconstruction errors in node2vec embeddings better than node centrality metrics. The experimental evaluation also shows that the proposed LID-elastic node2vec extensions improve node2vec by better preserving graph structure in generated embeddings.
翻译:局部本征维度(LID)的概念是数据维度分析的重要进展,在数据挖掘、机器学习和相似性搜索问题中具有广泛应用。现有基于距离的LID估计方法专为表格数据集设计,此类数据涉及欧氏空间中表示为向量的数据点。在讨论其对图结构数据(考虑图嵌入和图距离)的局限性后,我们提出NC-LID,这是一种与LID相关的新型度量,用于量化最短路径距离对自然社区(作为节点本征局部性)的判别能力。研究表明,通过提出两种基于NC-LID值调整个性化超参数的node2vec LID弹性变体,该度量可用于设计具有LID感知能力的图嵌入算法。对大量真实世界图数据的NC-LID实证分析表明,该度量比节点中心性度量更能识别node2vec嵌入中具有高链路重建误差的节点。实验评估还表明,所提出的LID弹性node2vec扩展通过更好地保持生成嵌入中的图结构来改进node2vec。