Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions, each curated to enhance the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on the representative LLM, we underscore the potency of Mol-Instructions to enhance the adaptability and cognitive acuity of large models within the complex sphere of biomolecular studies, thereby promoting advancements in the biomolecular research community. Mol-Instructions is made publicly accessible for future research endeavors and will be subjected to continual updates for enhanced applicability.
翻译:大规模语言模型凭借其卓越的任务处理能力和创新性输出,已在多个领域引发显著进展。然而,其在生物分子研究等专业领域的应用能力仍然有限。为解决这一挑战,我们提出了Mol-Instructions——一个专为生物分子领域精心策划的综合性指令数据集。Mol-Instructions包含三个核心组成部分:分子导向指令、蛋白质导向指令及生物分子文本指令,每个部分均旨在增强大规模语言模型对生物分子特征与行为的理解与预测能力。通过在代表性大规模语言模型上开展广泛的指令微调实验,我们验证了Mol-Instructions在提升大模型于复杂生物分子研究领域中的适应性与认知敏锐度方面的效力,从而推动生物分子研究社区的发展。Mol-Instructions已公开供未来研究使用,并将持续更新以增强其适用性。