Revisiting $O(n \log \log n)$ chaining for anchored edit distance

Colinear chaining is a classical heuristic for sequence alignment: it enables scalable genome comparison and is a main component of many state-of-the-art read mappers based on seed-chain-extend. The earliest $O(n \log \log n)$ time algorithms by Eppstein et al. (J. ACM, 1992) chained $n$ fragments between two sequences $T$ and $Q$ while minimizing a gap cost based on the diagonal distance $Δ_{\text{diag}}$ between consecutive fragments. They also forbid fragment overlaps, which are essential in current chaining formulations: in long-read mapping, overlaps improve sensitivity and avoid restrictions on the fragment class considered. Jain, Gibney, and Thankachan (J. Comput. Biol. 2022) recently combined a $Δ_{\text{diag}} = |Δ_T -Δ_Q|$ overlap cost with the classic $L_\infty = \max(Δ_T , Δ_Q)$ gap cost that takes the maximum between the horizontal and vertical gap between the fragments and they proved that chaining under this cost model is equivalent to the anchored edit distance. We improve the existing $O(n \log^3 n)$-time algorithm for anchored edit distance to $O(n \log \log n)$ time in $O(n)$ space, by combining the gap-cost computation of Chao and Miller (Algorithmica, 1995) with the overlap-cost computation of Baker and Giancarlo (ESA, 1998). By developing llchain, a simpler $O(n \log n)$-time implementation of our method, we show how chaining algorithms that might have been recently overlooked by the bioinformatics community scale competitively to millions of fragments and large genomes. On average, llchain is $10\times$ faster than other methods on instances with $3\,000\,000$ anchors, and over $3\times$ faster on MEMs between HiFi reads and a reference human genome.

翻译：共线链式是序列比对的一种经典启发式方法：它能够实现可扩展的基因组比较，并且是许多基于种子-链-扩展（seed-chain-extend）的先进读段映射器的核心组件。Eppstein 等人（J. ACM, 1992）最早提出的 $O(n \log \log n)$ 时间算法，将 $T$ 和 $Q$ 两条序列之间的 $n$ 个片段进行链式连接，同时基于连续片段之间的对角线距离 $Δ_{\text{diag}}$ 最小化间隙代价。该算法还禁止片段重叠，而片段重叠在当前链式公式中至关重要：在长读段映射中，重叠能提高灵敏度并避免对考虑片段类别的限制。Jain、Gibney 和 Thankachan（J. Comput. Biol. 2022）最近将基于 $Δ_{\text{diag}} = |Δ_T - Δ_Q|$ 的重叠代价与经典的 $L_\infty = \max(Δ_T , Δ_Q)$ 间隙代价（取片段之间水平间隙和垂直间隙的最大值）相结合，并证明在该代价模型下的链式等价于锚定编辑距离。我们通过将 Chao 和 Miller（Algorithmica, 1995）的间隙代价计算方法与 Baker 和 Giancarlo（ESA, 1998）的重叠代价计算方法相结合，将现有锚定编辑距离的 $O(n \log^3 n)$ 时间算法改进为 $O(n)$ 空间下的 $O(n \log \log n)$ 时间算法。通过开发 llchain（一种更简单的 $O(n \log n)$ 时间实现），我们展示了可能近期被生物信息学界忽视的链式算法如何能够扩展到数百万个片段和大型基因组。平均而言，在包含 $3\,000\,000$ 个锚点的实例上，llchain 比其它方法快 $10$ 倍，在 HiFi 读段与人类参考基因组之间的最大精确匹配（MEM）上，速度快 $3$ 倍以上。