Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, there are still new, more optimized exact and heuristic algorithms being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of 11 most widely used state-of-the-art methods for comparing genomic sequences and inform readers about their pros and cons using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available on: https://github.com/CMU-SAFARI/SequenceLab
翻译:计算复杂性是基因组分析的关键限制因素。因此,在过去30年间,研究者提出了众多快速启发式方法以缓解计算压力。基因组序列比对是大多数基因组分析中最基础的计算步骤。由于其高计算复杂性,目前仍不断有更优化的精确算法和启发式算法被提出。我们发现这些方法高度依赖于底层数据、数据质量及各类超参数。尽管这些方法应用广泛,但尚未进行深入分析研究,这可能导致基因序列被错误地排除在后续分析之外,并造成不必要的计算成本增加。我们首次对这种异质性进行了分析与基准测试,通过全面的实验评估,使用来自所有主流测序平台(Illumina、ONT和PacBio)的不同真实数据集,对11种最广泛使用的基因组序列比对前沿方法进行可行性总结,并向读者阐明其优劣。SequenceLab已公开在:https://github.com/CMU-SAFARI/SequenceLab