De Bruijn graphs are essential for sequencing data analysis and must be efficiently constructed and stored for large-scale population studies. They also need to be dynamic to allow updates such as adding or removing edges and nodes. Existing dynamic implementations include DynamicBOSS and dynamicDBG. In 2018, a new family of data structures called learned indexes was introduced by Tim Kraska and Alex Beutel, with a particularly efficient implementation proposed by Paolo Ferragina and Giorgio Vinciguerra in 2020. This paper presents a new method for implementing De Bruijn graphs using learned indexes and compares its performance with current implementations. The new method shows improved time and memory efficiency for edge and node insertions, particularly with large datasets (over 110 million k-mers).
翻译:德布鲁因图对于测序数据分析至关重要,在大规模群体研究中必须高效构建与存储。同时,其需要具备动态特性以支持边与节点的增删等更新操作。现有的动态实现包括DynamicBOSS与dynamicDBG。2018年,Tim Kraska与Alex Beutel提出了称为学习索引的新型数据结构族,其中Paolo Ferragina与Giorgio Vinciguerra于2020年提出了一种尤为高效的实现。本文提出了一种利用学习索引实现德布鲁因图的新方法,并将其性能与现有实现进行对比。新方法在边与节点插入操作中展现出更优的时间与内存效率,尤其在大规模数据集(超过1.1亿个k-mer)上表现显著。