Heterogeneous information networks (HINs) represent different types of entities and relationships between them. Exploring, analysing, and extracting knowledge from such networks relies on metapath queries that identify pairs of entities connected by relationships of diverse semantics. While the real-time evaluation of metapath query workloads on large, web-scale HINs is highly demanding in computational cost, current approaches do not exploit interrelationships among the queries. In this paper, we present ATRAPOS, a new approach for the real-time evaluation of metapath query workloads that leverages a combination of efficient sparse matrix multiplication and intermediate result caching. ATRAPOS selects intermediate results to cache and reuse by detecting frequent sub-metapaths among workload queries in real time, using a tailor-made data structure, the Overlap Tree, and an associated caching policy. Our experimental study on real data shows that ATRAPOS accelerates exploratory data analysis and mining on HINs, outperforming off-the-shelf caching approaches and state-of-the-art research prototypes in all examined scenarios. -- Note that this version of our work is more extended than the one presented in TheWebConf 2023 (doi: 10.1145/3543507.3583322)
翻译:异质信息网络(HINs)表示不同类型的实体及其之间的关系。探索、分析并从这类网络中提取知识依赖于元路径查询,该查询用于识别由不同语义关系连接的实体对。尽管在大规模、Web级别的HINs上实时评估元路径查询工作负载的计算成本极高,但现有方法未利用查询之间的相互关联性。本文提出ATRAPOS,一种基于高效稀疏矩阵乘法与中间结果缓存相结合的新方法,用于实时评估元路径查询工作负载。ATRAPOS通过实时检测工作负载查询中的频繁子元路径(采用定制化数据结构——重叠树及其关联缓存策略)来选择缓存并复用的中间结果。对真实数据的实验研究表明,ATRAPOS能够加速HINs上的探索性数据分析与挖掘,在所有测试场景中均优于现成的缓存方法和最先进的研究原型。——注意,本版本论文内容较发表于TheWebConf 2023的版本(doi: 10.1145/3543507.3583322)更为详尽。