In recent times, Transformer-based language models are making quite an impact in the field of natural language processing. As relevant parallels can be drawn between biological sequences and natural languages, the models used in NLP can be easily extended and adapted for various applications in bioinformatics. In this regard, this paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences. We have reviewed and analysed a large number of application-based papers on this subject, giving evidence of the main characterizing features and to different approaches that may be adopted to customize such powerful computational machines. We have also provided a structured description of the functioning of Transformers, that may enable even first time users to grab the essence of such complex architectures. We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences. This work will motivate the readers to build on these methodologies to tackle also various other problems in the field of bioinformatics.
翻译:近年来,基于Transformer的语言模型在自然语言处理领域产生了显著影响。由于生物序列与自然语言之间存在可类比性,自然语言处理中使用的模型能够轻松扩展并适配至生物信息学的多种应用场景。本文在此背景下系统介绍了近期基于Transformer的模型在核苷酸序列分析领域的主要进展。我们回顾并分析了大量相关应用型文献,归纳了此类强大计算模型的核心特征及定制化实现路径。同时,我们对Transformer的工作原理进行了结构化阐述,使初次接触者也能理解此类复杂架构的核心机制。我们相信本综述将有助于科研界深入理解基于Transformer的语言模型在核苷酸序列分析中的多样化应用,并激励读者基于这些方法论探索解决生物信息学领域的其他重要问题。