The identification of homologous gene families across multiple genomes is a central task in bacterial pangenomics traditionally requiring computationally demanding all-against-all comparisons. PanDelos addresses this challenge with an alignment-free and parameter-free approach based on k-mer profiles, combining high speed, ease of use, and competitive accuracy with state-of-the-art methods. However, the increasing availability of genomic data requires tools that can scale efficiently to larger datasets. To address this need, we present PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos. The algorithm parallelizes the most computationally intensive phases (Best Hit detection and Bidirectional Best Hit extraction) through data decomposition and a thread pool strategy, while employing lightweight data structures to reduce memory usage. Benchmarks on synthetic datasets show that PanDelos-plus achieves up to 14x faster execution and reduces memory usage by up to 96%, while maintaining accuracy. These improvements enable population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.
翻译:跨多个基因组识别同源基因家族是细菌泛基因组学中的核心任务,传统上需要计算密集的全对全比较。PanDelos通过一种基于k-mer谱的无比对、无参数方法应对这一挑战,将高速度、易用性与最先进方法的竞争性准确性相结合。然而,基因组数据的日益增长要求工具能够高效扩展至更大数据集。为满足这一需求,我们提出了PanDelos-plus,这是PanDelos的一个完全并行、以基因为中心的重设计。该算法通过数据分解和线程池策略并行化计算最密集的阶段(最佳命中检测和双向最佳命中提取),同时采用轻量级数据结构以降低内存使用。在合成数据集上的基准测试表明,PanDelos-plus实现了高达14倍的执行加速,并将内存使用降低高达96%,同时保持准确性。这些改进使得群体规模的比较基因组学能够在标准多核工作站上执行,使大规模细菌泛基因组分析在日常研究中可常规使用。