Pairwise sequence alignment with block and character edit operations

Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at http://github.com/BilkentCompGen/saber

翻译：成对序列比较是字符串处理中最基本的问题之一。衡量序列S和T相似性的最常见度量是编辑距离d(S,T)，它表示将S转换为T所需替换、删除或插入的字符数量。然而，若允许更大规模的重排操作，部分字符串对可能只需更少的编辑操作即可完成转换。块编辑距离即指这种子串级别（即块）的变换，其中块级删除、插入、复制和反转操作与单字符编辑具有相同代价（Lopresti & Tomkins, 1997）。迄今为止，多数块编辑距离计算研究仅致力于表征距离本身以用于序列近邻搜索，未报告完整比对细节。尽管少数工具（如GR-Aligner）尝试解决基因组序列的块编辑距离问题，但其功能有限且已停止维护。本文提出SABER算法，该算法在经典单字符编辑操作基础上，支持块删除、块移动和块反转操作。对于|S|=m、|T|=n且允许块大小范围为l_range的情况，算法时间复杂度为O(m²·n·l_range)，并可报告所有块操作的断点位置。我们同时提供了当前针对基因组序列（即DNA字母表）优化的SABER实现版本，尽管该算法理论上可应用于任意字母表。SABER代码已开源至http://github.com/BilkentCompGen/saber