Viruses represent the most abundant biological entities on Earth and play a pivotal role in microbial ecosystems, yet, as prominent human pathogens, they are closely linked to human morbidity and mortality. Accurate identification of viral sequences from viral genome sequences is therefore essential, but existing genome-based classification models that largely relying on composition- or frequency-based subsequence features often suffer from limited interpretability and reduced accuracy, particularly on complex or imbalanced datasets. To address these limitations, we propose GeneNSPCla (Genomic Negative Sequential Pattern-based Classification), a novel viral classification framework based on Negative Sequential Patterns (NSPs) that extracts discriminative absence-based features from nucleotide sequences of RNA viral genomes. By transforming these NSPs into numerical feature vectors and integrating them into multiple supervised classifiers, GeneNSPCla effectively captures both presence and absence signals in viral sequences. Furthermore, we propose a negative pattern mining algorithm adapted for processing genomic data: GONPM+, which can discover longer and more biologically meaningful negative sequential patterns. The experimental results demonstrate that the average accuracy of GONPM+ in 8 classifiers has improved by 10.03% compared to the original negative pattern mining algorithm and by 24.75% compared to the positive pattern mining algorithm. These findings highlight the effectiveness of incorporating absence-based sequential information, providing a new and complementary perspective for viral genome analysis and classification.
翻译:病毒是地球上最丰富的生物实体,在微生物生态系统中发挥着关键作用,同时作为重要的人类病原体,它们与人类的发病率和死亡率密切相关。因此,从病毒基因组序列中准确识别病毒序列至关重要,但现有的基于基因组的分类模型大多依赖组成或频率为基础的子序列特征,往往存在可解释性有限、准确率低的问题,尤其在处理复杂或不平衡数据集时表现不佳。为解决这些局限,我们提出GeneNSPCla(基于基因组负序模式的分类),这是一种基于负序列模式的新型病毒分类框架,能够从RNA病毒基因组的核苷酸序列中提取具有判别性的缺失特征。通过将这些NSP转化为数值特征向量并整合到多个监督分类器中,GeneNSPCla有效捕获了病毒序列中的存在信号和缺失信号。此外,我们提出一种适用于基因组数据处理的负模式挖掘算法GONPM+,该算法能发现更长且更具生物学意义的负序列模式。实验结果表明,在8个分类器中,GONPM+的平均准确率相比原始负模式挖掘算法提升了10.03%,相比正模式挖掘算法提升了24.75%。这些发现凸显了引入缺失序列信息的有效性,为病毒基因组分析与分类提供了全新的补充性视角。