Generative Language Models on Nucleotide Sequences of Human Genes

Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed.

翻译：语言模型，尤其是基于Transformer的模型，在自然语言处理领域取得了巨大成功。更具体地说，像BERT在自然语言理解中的研究以及GPT-3在自然语言生成中的工作都至关重要。DNA序列在结构上与自然语言非常接近，因此在涉及DNA的生物信息学领域，已存在诸如DNABert这样的判别式模型。然而，据我们所知，生成式模型方面尚未得到充分探索。因此，我们专注于为DNA序列开发类似GPT-3的自回归生成式语言模型。由于在没有大量计算资源的情况下处理完整DNA序列具有挑战性，我们决定将研究规模缩小，聚焦于人类基因的核苷酸序列——即具有特定功能的独特DNA片段，而非整个DNA。这一决定并未显著改变问题结构，因为DNA和基因均可视为由四种不同核苷酸组成的一维序列，而不会损失过多信息或过度简化。首先，我们系统地探讨了一个几乎完全未探索的问题，观察到RNN表现最佳，而简单的技术如N-gram也颇具潜力。另一个有益收获是学习如何在无法理解的语言上（与自然语言不同）使用生成式模型。我们认识到，使用超越传统指标（如困惑度）的现实任务至关重要。此外，我们检验了是否可以通过选择词汇量最小的语言（即仅有四种不同核苷酸）来改变这些模型对数据的贪婪需求。之所以审视这一点，是因为选择这样的语言可能使问题更简单。然而，本研究发现，这种方法并未大幅改变所需的数据量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日