We study the problem of finding maximal exact matches (MEMs) between a query string $Q$ and a labeled graph $G$. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $\kappa$ ($\kappa$-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., ICALP 2019) even on acyclic graphs. In this paper we show an $O(n\cdot L \cdot d^{L-1} + m + M_{\kappa,L})$-time algorithm finding all $\kappa$-MEMs between $Q$ and $G$ spanning exactly $L$ nodes in $G$, where $n$ is the total length of node labels, $d$ is the maximum degree of a node in $G$, $m = |Q|$, and $M_{\kappa,L}$ is the number of output MEMs. We use this algorithm to develop a $\kappa$-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $O(nH^2 + m + M_\kappa)$, where $H$ is the maximum number of nodes in a block, and $M_\kappa$ is the total number of $\kappa$-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between $G$ and any of the strings). Additionally, we provide some preliminary experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.
翻译:我们研究了查询字符串$Q$与带标签图$G$之间最大精确匹配(MEMs)的发现问题。MEMs是一类重要的种子片段,因其与经典度量指标的强关联性,常被用于种子-链-扩展类型的实用比对方法中。提升链构建效率的基本原则是通过仅考虑长度至少为$\kappa$的MEMs(即$\kappa$-MEMs)来限制MEM数量。然而在任意输入图上,即使针对无环图,在SETH假设下(Equi等,ICALP 2019)MEM发现问题也无法在严格的次二次时间内求解。本文提出一种时间复杂度为$O(n\cdot L \cdot d^{L-1} + m + M_{\kappa,L})$的算法,可找出$Q$与$G$之间恰好跨越$G$中$L$个节点的所有$\kappa$-MEMs,其中$n$为节点标签总长度,$d$为$G$中节点最大度数,$m = |Q|$,$M_{\kappa,L}$为输出MEM数量。基于该算法,我们为可索引弹性基因座图(Equi等,Algorithmica 2022)开发了一种$\kappa$-MEM发现方案,运行时间为$O(nH^2 + m + M_\kappa)$,其中$H$为区块内最大节点数,$M_\kappa$为$\kappa$-MEM总数量。本文结果可推广至多查询字符串分析($G$与任意字符串间的MEMs)。此外,初步实验表明图结构中的MEM数量比相应串联集合的字符串MEM数量低一个数量级。