Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.
翻译:高通量测序技术的进步提供了大量可供分析的基因组,因此计算方法在分析和提取生成数据中的知识方面发挥着关键作用。研究基因组突变至关重要,因为它们对染色体进化、遗传疾病和疾病均有影响。通常采用序列比对来分析基因组变异,但在处理大规模数据集时,这种方法可能计算成本高昂且存在潜在的随意性。本文提出了一种从组装基因组DNA序列中鉴定单核苷酸多态性(SNP)的新方法。该方法利用最大熵原理选择针对所研究变异最具信息量的k-mer。使用这一信息性k-mer集合能够在与参考序列比较时检测到变异特异性突变。此外,我们的方法提供了无需物种特异性信息即可对新序列进行分类的可能性。GRAMEP在计算机模拟和真实病毒基因组(包括登革热病毒、HIV和SARS-CoV-2)的分析中均展现出高准确性。我们的方法在保持准确鉴定SARS-CoV-2变异的同时,与金标准统计工具相比表现出更低的计算成本。本概念验证实现的源代码可在https://github.com/omatheuspimenta/GRAMEP免费获取。