Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.
翻译:特征提取是图分析中的一项关键任务。这些特征向量,即图描述符,用于下游基于向量空间的图分析模型。该方法在过去已被证明是有效的,基于谱的图描述符提供了最先进的分类准确率。然而,已知的计算有意义描述符的算法无法扩展到大规模图,原因在于:(1)它们需要将整个图存储在内存中,(2)用户无法控制算法的运行时间。本文提出了流式算法,能够近似计算三种不同且捕捉图本质结构的描述符。在边流上操作使我们能够避免将整个图存储在内存中,同时通过控制样本大小,可以将算法的运行时间保持在预期范围内。我们通过分析近似误差和分类准确率来验证所提出描述符的有效性。所提出的可扩展算法能够在几分钟内计算出包含数百万条边的图的描述符。此外,这些描述符产生的预测准确率与最先进方法相当,但仅需25%的内存即可完成计算。