Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed.
翻译:语言模型,尤其是基于Transformer的模型,在自然语言处理领域取得了巨大成功。更具体地说,像BERT在自然语言理解中的研究以及GPT-3在自然语言生成中的工作都至关重要。DNA序列在结构上与自然语言非常接近,因此在涉及DNA的生物信息学领域,已存在诸如DNABert这样的判别式模型。然而,据我们所知,生成式模型方面尚未得到充分探索。因此,我们专注于为DNA序列开发类似GPT-3的自回归生成式语言模型。由于在没有大量计算资源的情况下处理完整DNA序列具有挑战性,我们决定将研究规模缩小,聚焦于人类基因的核苷酸序列——即具有特定功能的独特DNA片段,而非整个DNA。这一决定并未显著改变问题结构,因为DNA和基因均可视为由四种不同核苷酸组成的一维序列,而不会损失过多信息或过度简化。首先,我们系统地探讨了一个几乎完全未探索的问题,观察到RNN表现最佳,而简单的技术如N-gram也颇具潜力。另一个有益收获是学习如何在无法理解的语言上(与自然语言不同)使用生成式模型。我们认识到,使用超越传统指标(如困惑度)的现实任务至关重要。此外,我们检验了是否可以通过选择词汇量最小的语言(即仅有四种不同核苷酸)来改变这些模型对数据的贪婪需求。之所以审视这一点,是因为选择这样的语言可能使问题更简单。然而,本研究发现,这种方法并未大幅改变所需的数据量。