Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.
翻译:从大规模图库中检索包含给定查询图子图同构的图,是众多实际应用中的核心操作。尽管近期基于多向量图表示以及集合对齐与包含度量的评分方法能够提供精确的子图同构测试,但其在检索中的应用仍受限于需要对图库中所有图进行穷举式评分。本文提出CORGII(面向倒排索引的图上下文表示),一种图索引框架:该框架从上下文稠密图表示出发,通过可微分离散化模块在学习的隐式词汇表上计算稀疏二值编码。这种类文本文档的表示形式使我们能够利用经典且高度优化的倒排索引,同时支持软(向量)集合包含度评分。进一步推进此范式,我们将图中“词元”的经典固定影响权重(如TFIDF或BM25)替换为数据驱动的可训练影响权重。最后,我们探索词元扩展技术以支持索引的多重探测,实现更平滑的精度-效率权衡。据我们所知,CORGII是首个将稠密图表示映射为离散词元并利用高效倒排列表的索引器。大量实验表明,与多种基线方法相比,CORGII在精度与效率之间提供了更优的权衡。