The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($\Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any \textit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
翻译:Transformer架构已彻底改变了生物信息学领域,并推动了对生物分子特性理解与预测的进展。目前几乎所有大规模生物序列Transformer的研究都集中于单一组学领域(单组学),通常是DNA/RNA或蛋白质。这些模型在各自领域的下游任务中取得了显著成功,尤其在序列建模和结构建模方面实现了突破性进展。然而,这些单组学模型本质上无法有效建模多组学任务,其中最具生物学关键性的任务之一是蛋白质-核酸相互作用。本研究提出了迄今为止最大的开源多组学基础模型的训练工作。我们证明这些多组学模型能够学习不同单组学分布之间的联合表征,尽管仅使用未标记的生物序列进行训练,这些表征仍能自发符合分子生物学的中心法则。我们进一步证明,通过对多组学模型进行微调,可以在蛋白质-核酸相互作用任务上取得最先进的结果,即预测特定核酸与蛋白质之间结合相互作用的吉布斯自由能变化($\Delta G$)。值得注意的是,我们发现多组学生物序列Transformer能够在没有任何先验结构训练的情况下自发学习有用的结构信息,从而能够预测哪些蛋白质残基在蛋白质-核酸结合相互作用中最为关键。最后,我们提供的证据表明,无论是在单位FLOP性能还是绝对性能上,多组学生物序列模型在许多情况下都优于基于单组学分布训练的基础模型,这为构建更具普适性或基础性的生物学模型提供了新思路。