Graph pattern mining is important for analyzing graph data. Graph mining systems typically require answering pattern matching queries, which involve solving the NP-complete subgraph isomorphism problem. To address this, domain experts often develop custom optimization strategies based on exploiting substructural similarities across different patterns. While these optimizers can be effective, their development is challenging, limiting the exploration of interactions between different optimization strategies and restricts experts from continuously improving the optimizers -- such as by incorporating additional custom or general pattern-based equivalences over time. We present a programmable pattern matching query optimizer called Geo, which automatically manages the interactions between various equivalences, ensures the optimizations maintain correctness of results, and simplifies the management of substructure equivalences. Geo exposes a simple but flexible language for expressing pattern equivalences as rewrite rules. By maintaining canonical representations of generated patterns during equality saturation, Geo avoids issues arising from syntactic differences in isomorphic patterns. Additionally, we develop embedded reconstructablility (EmRec) that tracks provenance across equivalences to ensure various reconstructability needs of desired outputs. Our evaluation demonstrates that Geo can discover novel query equivalences through complex composition of various rewrite rules, enabling our optimized queries to achieve a cost reduction of up to 99% compared to the queries in prior work. We further test Geo's effectiveness at speeding up practical graph mining problems by using it in two representative case studies -- approximate pattern matching and quasi-clique mining, and find it is highly effective at optimizing these tasks, enabling cost reductions of up to 71%.
翻译:图模式挖掘对于分析图数据至关重要。图挖掘系统通常需要回答模式匹配查询,这涉及解决NP-完全的子图同构问题。为此,领域专家通常基于不同模式间的子结构相似性,开发定制化的优化策略。尽管这些优化器可能很有效,但其开发颇具挑战性,这限制了不同优化策略之间交互作用的探索,并阻碍了专家持续改进优化器——例如,随着时间的推移,融入额外的定制或基于通用模式的等价关系。我们提出一种称为Geo的可编程模式匹配查询优化器,它能自动管理各种等价关系之间的交互,确保优化保持结果的正确性,并简化子结构等价关系的管理。Geo提供了一种简单但灵活的语言,用于将模式等价关系表达为重写规则。通过在相等饱和过程中维护所生成模式的规范表示,Geo避免了由同构模式的语法差异引发的问题。此外,我们开发了嵌入式可重构性(EmRec),该机制能追踪跨等价关系的数据来源,以确保所需输出的各种可重构性需求得以满足。我们的评估表明,Geo能够通过不同重写规则的复杂组合发现新颖的查询等价关系,从而使优化后的查询相比先前工作中的查询实现高达99%的成本降低。我们进一步通过两个代表性案例研究——近似模式匹配和拟团挖掘,测试了Geo在加速实际图挖掘问题上的有效性,发现它在优化这些任务方面非常高效,能够实现高达71%的成本降低。