Elastic Founder Graphs Improved and Enhanced

from arxiv, 47 pages, 10 figures. Extension of conference papers IWOCA 2022 (https://doi.org/10.1007/978-3-031-06678-8_35 , preprint arXiv:2201.06492), CPM 2022 (https://doi.org/10.4230/LIPIcs.CPM.2022.19 ), and of some results from PhD dissertation projects of Massimo Equi (http://urn.fi/URN:ISBN:978-951-51-8217-3 ) and Tuukka Norri (http://urn.fi/URN:ISBN:978-951-51-8215-9 )

Indexing labeled graphs for pattern matching is a central challenge of pangenomics. Equi et al. (Algorithmica, 2022) developed the Elastic Founder Graph ($\mathsf{EFG}$) representing an alignment of $m$ sequences of length $n$, drawn from alphabet $\Sigma$ plus the special gap character: the paths spell the original sequences or their recombination. By enforcing the semi-repeat-free property, the $\mathsf{EFG}$ admits a polynomial-space index for linear-time pattern matching, breaking through the conditional lower bounds on indexing labeled graphs (Equi et al., SOFSEM 2021). In this work we improve the space of the $\mathsf{EFG}$ index answering pattern matching queries in linear time, from linear in the length of all strings spelled by three consecutive node labels, to linear in the size of the edge labels. Then, we develop linear-time construction algorithms optimizing for different metrics: we improve the existing linearithmic construction algorithms to $O(mn)$, by solving the novel exclusive ancestor set problem on trees; we propose, for the simplified gapless setting, an $O(mn)$-time solution minimizing the maximum block height, that we generalize by substituting block height with prefix-aware height. Finally, to show the versatility of the framework, we develop a BWT-based $\mathsf{EFG}$ index and study how to encode and perform document listing queries on a set of paths of the graphs, reporting which paths present a given pattern as a substring. We propose the $\mathsf{EFG}$ framework as an improved and enhanced version of the framework for the gapless setting, along with construction methods that are valid in any setting concerned with the segmentation of aligned sequences.

翻译：索引带标签图以进行模式匹配是泛基因组学中的一个核心挑战。Equi等人（Algorithmica, 2022）提出了弹性创始人图（$\mathsf{EFG}$），它表示从字母表$\Sigma$（加上特殊间隙字符）中抽取的$m$条长度为$n$的序列的比对：路径描述了原始序列或其重组。通过强制执行半无重复性质，$\mathsf{EFG}$支持一个多项式空间索引，用于线性时间模式匹配，突破了索引带标签图的条件性下界（Equi等人，SOFSEM 2021）。在本工作中，我们改进了$\mathsf{EFG}$索引的空间：将回答线性时间模式匹配查询的空间需求，从与三个连续节点标签所拼写的所有字符串长度呈线性关系，优化为与边标签大小呈线性关系。然后，我们开发了针对不同度量优化的线性时间构造算法：通过解决树上新颖的独占祖先集问题，我们将现有的近线性构造算法改进为$O(mn)$；针对简化的无间隙设置，我们提出了一个$O(mn)$时间的解决方案，该方案最小化最大块高度，并通过用前缀感知高度替换块高度来推广。最后，为展示框架的多功能性，我们开发了一个基于BWT的$\mathsf{EFG}$索引，并研究了如何对图的一组路径进行编码和执行文档列举查询，即报告哪些路径包含给定模式作为子串。我们将$\mathsf{EFG}$框架作为无间隙设置框架的改进和增强版本提出，同时提供了在任何与比对序列分段相关的设置中有效的构造方法。