Sequence alignment is a cornerstone technique in computational biology for assessing similarities and differences among biological sequences. A key variant, sequence-to-graph alignment, plays a crucial role in effectively capturing genetic variations. In this work, we introduce two novel formulations within this framework: the Gap-sensitive Co-Linear Chaining (Gap-CLC) problem and the Co-Linear Chaining with Errors based on Edit Distance (Edit-CLC) problem, and we investigate their computational complexity. We show that solving the Gap-CLC problem in sub-quadratic time is highly unlikely unless the Strong Exponential Time Hypothesis fails -- even when restricted to binary alphabets. Furthermore, we establish that the Edit-CLC problem is NP-hard in the presence of errors within the pan-genome graph. These findings emphasize that incorporating co-linear structures into sequence-to-graph alignment models fails to reduce computational complexity, highlighting that these models remain at least as computationally challenging to solve as those lacking such prior information.
翻译:序列比对是计算生物学中评估生物序列相似性与差异性的基石技术。序列-图比对作为其关键变体,在有效捕获遗传变异方面发挥着至关重要的作用。本研究在该框架下提出两种新颖的数学表述:基于间隙敏感的共线性链式问题与基于编辑距离的容错共线性链式问题,并深入探究其计算复杂性。我们证明,即使在二元字母表限制下,除非强指数时间假说不成立,否则在亚二次时间内求解间隙敏感共线性链式问题的可能性极低。此外,我们确立了在全基因组图中存在误差时,基于编辑距离的容错共线性链式问题具有NP难解性。这些发现表明,将共线性结构整合到序列-图比对模型中并不能降低计算复杂性,凸显了此类模型在计算难度上至少与缺乏先验信息的传统模型相当。