Sequence alignment is common nowadays as it is used in many fields to determine how closely two sequences are related and at times to see how little they differ. In computational biology / Bioinformatics, there are many algorithms developed over the course of time to not only align two sequences quickly but also get good laboratory results from these alignments. The first algorithms developed were based of a technique called Dynamic Programming, which were very slow but were optimal when it comes to sensitivity. To improve speed, more algorithms today are based of heuristic approach, by sacrificing sensitivity. In this paper, we are going to improve on a heuristic algorithm called MASAA (Multiple Anchor Staged Local Sequence Alignment Algorithm) and MASAA Sensitive which we published previously. This new algorithm appropriately called Maximum Match Subsequence Alignment Algorithm Finely Grained. The algorithm is based on suffix tree data structure like our previous algorithms, but to improve sensitivity, we employ adaptive seeds, and finely grained perfect match seeds in between the already identified anchors. We tested this algorithm on a randomly generated sequences, and Rosetta dataset where the sequence length ranged up to 500 thousand.
翻译:序列对齐在当今广泛应用,用于确定两个序列的相关程度,有时也用于观察它们之间的差异大小。在计算生物学/生物信息学领域,随着时间推移已开发出许多算法,不仅能够快速对齐两个序列,还能从这些对齐中获取良好的实验结果。最初开发的算法基于一种称为动态规划的技术,这些算法速度较慢,但在灵敏度方面具有最优性。为了提升速度,如今更多算法采用启发式方法,通过牺牲灵敏度来换取速度。在本文中,我们将改进一种我们此前发表的启发式算法——MASAA(多重锚点分段局部序列对齐算法)和MASAA敏感算法。这一新算法被恰当地命名为最大匹配子序列对齐精细算法。该算法与我们之前的算法一样,基于后缀树数据结构,但为了提升灵敏度,我们采用了自适应种子,以及在已识别锚点之间的精细完美匹配种子。我们在随机生成的序列以及序列长度高达50万的Rosetta数据集上测试了该算法。