Generative Language Models on Nucleotide Sequences of Human Genes

Language models, especially transformer-based ones, have achieved colossal success in NLP. To be precise, studies like BERT for NLU and works like GPT-3 for NLG are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABert in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes rather than the whole DNA. This decision has not changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. Firstly, we systematically studied an almost entirely unexplored problem and observed that RNNs perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

翻译：语言模型，特别是基于Transformer的模型，已在自然语言处理领域取得巨大成功。具体而言，诸如BERT在自然语言理解方面的研究以及GPT-3在自然语言生成方面的工作都具有重要意义。若将DNA序列视为由代表四种核苷酸的字母组成的文本，其在结构上与自然语言具有相似性。这种相似性促使DNA相关生物信息学领域发展出了判别式语言模型，例如DNABert。然而据我们所知，生成式模型的研究在很大程度上仍未得到充分探索。因此，我们致力于开发适用于DNA序列的自回归生成式语言模型（如GPT-3）。由于在缺乏充足计算资源的情况下处理完整DNA序列具有挑战性，我们决定缩小研究规模，专注于人类基因的核苷酸序列而非完整DNA。这一决策并未改变问题的本质结构，因为DNA和基因均可视为由四种不同核苷酸组成的一维序列，且不会丢失过多信息或导致过度简化。首先，我们系统研究了这个几乎未被探索的问题，发现循环神经网络表现最佳，而N-gram等简单技术也展现出潜力。另一个重要收获是学会了如何在我们不理解的语言（与自然语言不同）上使用生成模型。研究强调了除困惑度等传统指标外，采用实际任务进行评估的重要性。此外，我们通过选择词汇量极小的语言（因四种核苷酸类型而仅含四个词汇）来探究这些模型的数据饥渴特性是否能够改变。研究此问题的初衷在于，选择此类语言可能降低问题复杂度。但本研究表明，这并未显著改变模型所需的数据量。