Motif finding is an important step for the detection of rare events occurring in a set of DNA or protein sequences. Extraction of information about these rare events can lead to new biological discoveries. Motifs are some important patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Although several flavors of motif searching algorithms have been studied in the literature, we study the version known as $ (l, d) $-motif search or Planted Motif Search (PMS). In PMS, given two integers $ l $, $ d $ and $ n $ input sequences we try to find all the patterns of length $ l $ that appear in each of the $ n $ input sequences with at most $ d $ mismatches. We also discuss the quorum version of PMS in our work that finds motifs that are not planted in all the input sequences but at least in $ q $ of the sequences. Our algorithm is mainly based on the algorithms qPMSPrune, qPMS7, TraverStringRef and PMS8. We introduce some techniques to compress the input strings and make faster comparison between strings with bitwise operations. Our algorithm performs a little better than the existing exact algorithms to solve the qPMS problem in DNA sequence. We have also proposed an idea for parallel implementation of our algorithm.
翻译:基序发现是检测DNA或蛋白质序列中稀有事件发生的重要步骤。提取这些稀有事件的信息可催生新的生物学发现。基序作为具有广泛应用的模式,涵盖转录因子及其结合位点识别、复合调控模式分析、蛋白质家族相似性研究等领域。尽管文献中已提出多种基序搜索算法变体,本研究聚焦于称为$(l, d)$-基序搜索或植入式基序搜索(PMS)的版本。在PMS问题中,给定两个整数$l$、$d$及$n$条输入序列,我们试图找出所有长度为$l$且每条输入序列中至多存在$d$个错配的模式。此外,本文还讨论了PMS的法定人数版本,该版本寻找并非植入所有输入序列、而是至少存在于$q$条序列中的基序。我们的算法主要基于qPMSPrune、qPMS7、TraverStringRef和PMS8算法,通过引入输入字符串压缩技术及利用位运算实现更快速的字符串比较。该算法在解决DNA序列的qPMS问题上略优于现有精确算法,并提出了算法并行化实现方案。