DNA technologies have evolved significantly in the past years enabling the sequencing of a large number of genomes in a short time. Nevertheless, the underlying computational problem is hard, and many technical factors and limitations complicate obtaining the complete sequence of a genome. Many genomes are left in a draft state, in which each chromosome is represented by a set of sequences with partial information on their relative order. Recently, some approaches have been proposed to compare draft genomes by comparing paths in de Bruijn graphs, which are constructed by many practical genome assemblers. In this article we introduce gcBB, a method for comparing genomes represented as succinct colored de Bruijn graphs directly, without resorting to sequence alignments, by means of the entropy and expectation measures based on the Burrows-Wheeler Similarity Distribution. We also introduce an improved version of gcBB, called mgcBB, that improves the time performance considerably through the selection of different data structures. We have compared phylogenies of genomes obtained by other methods to those obtained with gcBB, achieving promising results.
翻译:DNA测序技术在过去几年中取得了显著进展,使得在短时间内对大量基因组进行测序成为可能。然而,底层计算问题本身具有挑战性,且许多技术因素和局限性使得获取完整的基因组序列变得复杂。许多基因组停留在草图状态,其中每条染色体由一组序列表示,这些序列仅包含其相对顺序的部分信息。最近,已有一些方法提出通过比较德布鲁因图中的路径来比较草图基因组,而德布鲁因图是许多实用基因组组装工具所构建的。在本文中,我们介绍了gcBB方法,该方法通过基于Burrows-Wheeler相似性分布的熵和期望度量,直接比较表示为简洁着色德布鲁因图的基因组,而无需依赖序列比对。我们还介绍了gcBB的改进版本mgcBB,它通过选择不同的数据结构显著提升了时间性能。我们已将其他方法获得的基因组系统发育树与使用gcBB获得的结果进行了比较,取得了令人鼓舞的成果。