The COVID-19 pandemic has initiated an unprecedented worldwide effort to characterize its evolution through the mapping of mutations of the coronavirus SARS-CoV-2. The early identification of mutations that could confer adaptive advantages to the virus, such as higher infectivity or immune evasion, is of paramount importance. However, the large number of currently available genomes precludes the efficient use of phylogeny-based methods. Here we present CoVtRec, a fast and scalable Topological Data Analysis approach for the surveillance of emerging adaptive mutations in large genomic datasets. Our method overcomes limitations of state-of-the-art phylogeny-based approaches by quantifying the potential adaptiveness of mutations merely by their topological footprint in the genome alignment, without resorting to the reconstruction of a single optimal phylogenetic tree. Analyzing millions of SARS-CoV-2 genomes from GISAID, we find a correlation between topological signals and adaptation to the human host. By leveraging the stratification by time in sequence data, our method enables the high-resolution longitudinal analysis of topological signals of adaptation. We characterize the convergent evolution of the coronavirus throughout the whole pandemic to date, report on emerging potentially adaptive mutations, and pinpoint mutations in Variants of Concern that are likely associated with positive selection. Our approach can improve the surveillance of mutations of concern and guide experimental studies.
翻译:COVID-19大流行引发了一场前所未有的全球性努力,旨在通过绘制冠状病毒SARS-CoV-2的突变图谱来表征其进化。早期识别可能赋予病毒适应性优势(例如更高传染性或免疫逃逸能力)的突变至关重要。然而,目前可用的大量基因组序列阻碍了基于系统发育方法的高效应用。本文提出CoVtRec,一种快速且可扩展的拓扑数据分析方法,用于监测大型基因组数据集中新兴的适应性突变。我们的方法通过量化突变在基因组比对中的拓扑足迹(而非重建单个最优系统发育树)来评估其潜在适应性,从而克服了现有基于系统发育方法的局限性。通过分析来自GISAID的数百万个SARS-CoV-2基因组,我们发现拓扑信号与人类宿主适应性之间存在相关性。利用序列数据在时间上的分层特性,我们的方法能够对适应性拓扑信号进行高分辨率纵向分析。我们表征了冠状病毒在整个大流行至今的趋同进化,报告了新兴且可能具有适应性的突变,并指出了关注变异株中可能与正选择相关的突变。我们的方法可改进对关注突变的监测,并为实验研究提供指导。