In ongoing work to define a principled method for syntenic block discovery and structuring, work based on homology-derived constraints and a generalization of common intervals, we faced a fundamental computational problem: how to determine quickly, among a set of indeterminate strings (strings whose elements consist of subsets of characters), contiguous intervals that would share a vast majority of their elements, but allow for sharing subsets of characters subsumed by others, and also for certain elements to be missing from certain genomes. An algorithm for this problem in the special case of determinate strings (where each element is a single character of the alphabet, i.e., "normal" strings) was described by Doerr et al., but its running time would explode if generalized to indeterminate strings. In this paper, we describe an algorithm for computing these special common intervals in time close to that of the simpler algorithm of Doerr et al. and show that can compute these intervals in just a couple of hours for large collections (tens to hundreds) of bacterial genomes.
翻译:在为基于同源性约束和公共区间泛化定义合子组块发现与结构化的原则性方法的持续工作中,我们面临一个基本计算问题:如何在一组不定字符串(其元素由字符子集组成的字符串)中,快速确定连续区间,这些区间应共享绝大部分元素,允许包含被其他子集包含的字符子集,同时允许某些基因组中缺失特定元素。针对此问题,在确定字符串(每个元素为字母表中的单个字符,即"常规"字符串)的特殊情况下,Doerr等人描述了一种算法,但如果推广到不定字符串,该算法的运行时间将急剧增长。在本文中,我们描述了一种算法,其计算这些特殊公共区间的时间接近Doerr等人更简单算法的运行时间,并证明该算法可在数小时内处理大型细菌基因组集合(数十至数百个)。