Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about their advantages and downsides using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab.
翻译:计算复杂度是基因组分析的关键限制因素。因此,在过去30年间,研究人员提出了大量快速启发式方法来缓解计算压力。基因组序列比较是大多数基因组分析中最基础的计算步骤之一。由于其高计算复杂度,优化的精确算法与启发式算法仍在持续发展中。我们发现这些方法对底层数据、数据质量及各类超参数高度敏感。尽管这些方法应用广泛,但目前尚未进行深入分析,这可能导致错误地丢弃遗传序列而终止后续分析,并会不必要地增加计算成本。我们首次针对这一异质性进行了分析与基准测试。我们提供了11种最广泛使用的基因组序列比较前沿方法的可操作性概览,并通过详尽的实验评估及来自所有主流测序平台(即Illumina、ONT和PacBio)的不同真实数据集,向读者阐明这些方法的优势与不足。SequenceLab已公开在https://github.com/CMU-SAFARI/SequenceLab。