Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GeneMask, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GeneMask-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GeneMask-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.
翻译:诸如DNABert和LOGO等大规模语言模型旨在学习最优的基因表示,并在整个人类参考基因组上进行训练。然而,标准的标记化方案采用简单的滑动窗口(如k-mer)来生成标记,这种方式未利用任何基于基因的语义信息,可能导致对易预测序列的(琐碎)掩码,进而降低掩码语言模型(MLM)训练效率。为此,我们提出了一种新的基因序列MLM训练掩码算法——GeneMask。该算法随机选取基因序列中的位置作为掩码中心,并在局部范围内选择归一化点互信息(NPMI)最高的跨度区域进行掩码。我们观察到,在基因组学领域缺乏人类可理解的语义(相比之下,NLP领域自然存在词语和短语等语义单元)的情况下,基于GeneMask的模型在五个少样本(10样本至1000样本)设置中,于四个基准基因序列分类数据集上显著优于当前最优模型(DNABert和LOGO)。更重要的是,基于GeneMask的DNABert模型训练轮次不足原始最优模型的十分之一。我们还发现,排名靠前的PMI标记与保守DNA序列基序之间存在强相关性,这可能表明模型隐含地整合了潜在的基因组信息。相关代码(包括训练好的模型)和数据集已在https://github.com/roysoumya/GeneMask 公开提供。