Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
翻译:大型语言模型凭借其卓越的任务处理能力和创新性输出,在多个领域引发了重大进展。然而,其在生物分子研究等专业领域的应用能力仍十分有限。为应对这一挑战,我们推出了Mol-Instructions——一个专为生物分子领域设计的综合性指令数据集。该数据集包含三大核心组成部分:面向分子的指令、面向蛋白质的指令以及生物分子文本指令。每一部分均旨在提升大型语言模型对生物分子特征与行为的理解及预测能力。通过在大型语言模型上进行广泛的指令微调实验,我们证明了Mol-Instructions在增强大型模型处理复杂生物分子研究方面的有效性,从而推动生物分子研究社区的进步。Mol-Instructions现已公开供持续研究使用,并将定期更新以提升其适用性。