Prediction of binding sites for transcription factors is important to understand how they regulate gene expression and how this regulation can be modulated for therapeutic purposes. Although in the past few years there are significant works addressing this issue, there is still space for improvement. In this regard, a transformer based capsule network viz. DNABERT-Cap is proposed in this work to predict transcription factor binding sites mining ChIP-seq datasets. DNABERT-Cap is a bidirectional encoder pre-trained with large number of genomic DNA sequences, empowered with a capsule layer responsible for the final prediction. The proposed model builds a predictor for transcription factor binding sites using the joint optimisation of features encompassing both bidirectional encoder and capsule layer, along with convolutional and bidirectional long-short term memory layers. To evaluate the efficiency of the proposed approach, we use a benchmark ChIP-seq datasets of five cell lines viz. A549, GM12878, Hep-G2, H1-hESC and Hela, available in the ENCODE repository. The results show that the average area under the receiver operating characteristic curve score exceeds 0.91 for all such five cell lines. DNABERT-Cap is also compared with existing state-of-the-art deep learning based predictors viz. DeepARC, DeepTF, CNN-Zeng and DeepBind, and is seen to outperform them.
翻译:转录因子结合位点的预测对于理解其如何调控基因表达以及如何调控这一过程以实现治疗目的至关重要。尽管过去几年已有大量研究致力于解决这一问题,但仍存在改进空间。为此,本文提出了一种基于Transformer的胶囊网络——DNABERT-Cap,用于挖掘ChIP-seq数据集以预测转录因子结合位点。DNABERT-Cap是一个经过大量基因组DNA序列预训练的双向编码器,并配备了一个负责最终预测的胶囊层。该模型通过联合优化包含双向编码器和胶囊层,以及卷积层和双向长短期记忆层的特征,构建了转录因子结合位点的预测器。为评估所提出方法的有效性,我们使用了ENCODE数据库中五种细胞系(A549、GM12878、Hep-G2、H1-hESC和Hela)的基准ChIP-seq数据集。结果表明,在所有五种细胞系中,平均受试者工作特征曲线下面积得分均超过0.91。此外,将DNABERT-Cap与现有最先进的基于深度学习的预测器(如DeepARC、DeepTF、CNN-Zeng和DeepBind)进行比较,结果显示其性能更优。