Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
翻译:近期生物学研究进展借助分子、蛋白质与自然语言的整合来推动药物发现。然而,现有模型存在若干局限性,包括生成无效分子SMILES、对上下文信息利用不足,以及将结构化与非结构化知识等同对待。为解决这些问题,我们提出$\mathbf{BioT5}$,一个通过化学知识与自然语言关联增强生物学跨模态整合的综合预训练框架。$\mathbf{BioT5}$采用SELFIES实现$100%$稳健的分子表征,并从非结构化生物文献中生物实体周围的上下文提取知识。此外,$\mathbf{BioT5}$区分结构化与非结构化知识,从而实现更高效的信息利用。经过微调后,BioT5在广泛任务中展现出优越性能,证明其具备捕捉生物实体潜在关联与属性的强大能力。我们的代码已发布于$\href{https://github.com/QizhiPei/BioT5}{Github}$。