Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}.

翻译：摘要：生物分子建模与自然语言的交叉融合（BL）已成为人工智能、化学和生物学交叉领域一个前景广阔的新兴研究方向。该方法通过利用文本数据源中蕴含的丰富、多层面的生物分子描述信息，增进我们对生物分子的基础理解，并赋能诸如生物分子属性预测等下游计算任务。自然语言所表达的微妙叙事与通过多种分子建模技术所描述的生物分子结构和功能细节相融合，为全面表征和分析生物分子开辟了新途径。通过将生物分子周围的情境语言数据纳入其建模过程，BL旨在捕捉一个整体视角，既包含语言所传达的符号特征，也涵盖定量的结构特性。本综述对生物分子与自然语言跨模态建模的最新进展进行了广泛分析。（1）首先，我们概述了所使用的生物分子技术表示方法，包括序列、二维图和三维结构。（2）接着，我们深入探讨了语言与分子数据源有效多模态融合背后的基本原理和关键目标。（3）随后，我们综述了该新兴研究领域迄今已实现的实际应用。（4）我们还整理并总结了可用的资源和数据集，以促进未来的工作。（5）展望未来，我们指出了若干值得进一步探索和投入的前瞻性研究方向，以持续推动该领域的发展。相关资源和内容正在 \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling} 中更新。