Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
翻译:近期生物学研究的前沿进展借助分子、蛋白质与自然语言的整合来促进药物发现。然而,当前模型存在若干局限,如生成无效分子SMILES、未能充分利用上下文信息,以及将结构化与非结构化知识等同处理。为解决这些问题,我们提出$\mathbf{BioT5}$——一个通过化学知识与自然语言关联来丰富生物学跨模态整合的综合预训练框架。$\mathbf{BioT5}$使用SELFIES实现$100%$鲁棒的分子表示,并从未结构化生物学文献中生物实体的上下文里提取知识。此外,$\mathbf{BioT5}$区分结构化与非结构化知识,从而实现更高效的信息利用。微调后,BioT5在广泛任务中展现出优越性能,充分体现了其捕获生物实体内在关系与属性的强大能力。我们的代码已开源至$\href{https://github.com/QizhiPei/BioT5}{Github}$。