Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at http://github.com/BilkentCompGen/saber
翻译:成对序列比较是字符串处理中最基本的问题之一。量化序列S与T之间相似性的最常用指标是编辑距离d(S,T),其对应于将S转换为T所需替换、删除或插入的字符数量。然而,若允许更大规模的重排操作,某些字符串对可能通过更少的编辑操作即可实现转换。块编辑距离即指子字符串层面(即块)的此类变更,其将整块删除、插入、复制和反转的代价等同于单字符编辑操作(Lopresti & Tomkins, 1997)。迄今为止,大多数计算块编辑距离的研究仅旨在为序列最近邻搜索等应用表征距离本身,而未提供完整的比对细节。尽管少数工具(如GR-Aligner)尝试为基因组序列计算块编辑距离,但其功能有限且已停止维护。本文提出SABER算法,该算法在经典单字符编辑操作基础上,额外支持块删除、块移动和块反转操作。对于|S|=m、|T|=n及允许的块大小范围l_range,本算法时间复杂度为O(m^2·n·l_range),并可报告所有块操作的断点信息。我们提供了当前针对基因组序列(即DNA字母表生成)优化的SABER实现,但该算法理论上适用于任何字母表。SABER可通过http://github.com/BilkentCompGen/saber获取。