Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
翻译:大型语言模型(LLMs)凭借其卓越的任务处理能力和创新性输出,已在众多领域推动了重大进展。然而,它们在生物分子研究等专业领域的效能仍存在局限。为解决这一挑战,我们提出了Mol-Instructions——一个专为生物分子领域设计的综合性指令数据集。Mol-Instructions包含三大核心组成部分:分子导向指令、蛋白质导向指令及生物分子文本指令,每一部分都旨在提升LLMs对生物分子特征与行为的理解与预测能力。通过对LLMs进行广泛的指令微调实验,我们验证了Mol-Instructions在提升大型模型处理复杂生物分子研究任务效能方面的有效性,从而促进生物分子研究社区的进步。Mol-Instructions现已公开供持续研究使用,并将定期更新以提升其适用性。