Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
翻译:自监督预训练已被公认为提升多种下游任务预测精度的方法,但其在DNA序列上的有效性仍受一定限制。这一局限主要源于现有基因组学自监督方法大多聚焦于单序列掩码语言建模,忽视了跨序列统计编码这一关键方面。为突破这一挑战,我们提出一种创新的深度神经网络模型,该模型融合了"学生"与"教师"子网络间的协作学习。其中,学生子网络对核苷酸进行掩码学习,并通过指数移动平均方法逐步将参数适配至教师子网络。与此同时,两个子网络通过对比学习,从输入序列的两种增强表示中获取信息。这种自蒸馏过程使模型能够有效融合单序列的上下文信息与序列群体的分布数据。我们通过人类参考基因组的预训练验证该方法,随后将其应用于20个下游推断任务。实验结果表明,我们所提出的方法在大多数任务中显著提升了推断性能。代码已开源至https://github.com/wiedersehne/FinDNA。