We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural proteins, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform 8 distinct tasks, with available datasets it can be extended to solve additional problems. In a broader sense, this work illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties via a synergizing learning capacity to express a set of potentialities embedded in the knowledge used in training, via the interplay of universality and diversity.
翻译:我们报道了一种基于语言模型的灵活深度学习策略,该策略应用于解决蛋白质建模中复杂的正向与逆向问题,其核心是一种结合Transformer与图卷积架构的注意力神经网络,通过因果多头图机制实现生成式预训练模型。该模型可预测二级结构含量(残基水平与整体含量)、蛋白质溶解度及测序任务。进一步针对逆向任务训练后,该模型能够设计具有这些目标特性的蛋白质。模型被构建为一个通用框架,完全基于提示驱动,可适配多种下游任务。我们发现,添加额外任务会产生模型利用的涌现协同效应,从而提升整体性能,其效果远超单独对每个数据集训练模型所能达到的水平。我们通过案例研究验证了该方法,重点生成了针对结构蛋白的设计,同时探索了其在可溶性抗菌生物材料设计中的适用性。尽管我们的模型最终针对8个不同任务进行训练,但利用现有数据集可扩展至解决更多问题。更广泛而言,这项工作展示了一种多尺度建模形式,将一组终极构建单元(此处为定义当前物理系统性质的字节级UTF8字符)与复杂输出相关联。这种材料组学方案通过协同学习能力捕捉通用构建块与结果属性之间复杂的涌现关系,借助普适性与多样性的相互作用,表达训练所用知识中蕴含的一组潜在可能性。