Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.

翻译：蛋白质是生物系统的基本组成单元，可通过多种模态进行表征，包括序列、结构和文本描述。尽管深度学习和科学大语言模型（LLMs）在蛋白质研究中取得了进展，但现有方法主要局限于特定的专业任务——通常仅从一种蛋白质模态预测另一种模态。这些方法限制了对多模态蛋白质数据的理解与生成。相比之下，大型多模态模型已在生成任意模态内容（如文本、图像和视频）方面展现出潜力，从而丰富了跨领域的用户交互。将此类多模态模型技术整合到蛋白质研究中具有重大前景，有望变革蛋白质的研究范式。为此，我们提出了HelixProtX——一个基于大型多模态模型构建的系统，旨在通过支持任意蛋白质模态间的生成，为蛋白质研究提供全面解决方案。与现有方法不同，该系统允许将任意输入蛋白质模态转换为任意目标蛋白质模态。实验结果证实了HelixProtX的先进能力：不仅能够从氨基酸序列生成功能描述，还能执行从文本描述设计蛋白质序列和结构等关键任务。初步研究表明，HelixProtX在一系列蛋白质相关任务中均能持续取得更高的准确率，其性能优于当前最先进的模型。通过将多模态大模型融入蛋白质研究，HelixProtX为理解蛋白质生物学开辟了新途径，有望加速科学发现进程。