This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
翻译:本研究提出CGRclust,一种将DNA序列的混沌博弈表示与卷积神经网络相结合的无监督孪生对比聚类新方法。据我们所知,CGRclust是首个利用无监督学习进行图像分类(此处应用于二维CGR图像)以实现DNA序列数据集聚类的方法。该方法通过无监督孪生对比学习检测独特序列模式,克服了传统序列分类方法的局限性,且无需DNA序列比对或生物学/分类学标签。CGRclust成功聚类了25个多样化数据集,其序列长度范围从664 bp至100 kbp,涵盖鱼类、真菌和原生生物的线粒体基因组,以及病毒全基因组组装体和合成DNA序列。与近期三种DNA序列聚类方法(DeLUCS、iDeLUCS和MeShClust v3.0)相比,在鱼类线粒体DNA基因组的四个分类层级测试中,CGRclust是唯一在所有层级准确率均超过81.70%的方法。此外,CGRclust在所有病毒基因组数据集上也持续表现出优越性能。该方法在25个数据集上展现的高聚类准确度——这些数据集在序列长度、基因组数量、聚类数目和分类层级方面差异显著——证明了其鲁棒性、可扩展性和通用性。