Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
翻译:大语言模型凭借其卓越的任务处理能力和创新性输出,已在众多领域引发显著进展。然而,其在生物分子研究等专业领域的应用能力仍十分有限。为解决这一挑战,我们提出Mol-Instructions——一个为生物分子领域设计的综合性指令数据集。该数据集包含三大核心模块:面向分子的指令、面向蛋白质的指令以及生物分子文本指令。每个模块旨在提升大语言模型对生物分子特征与行为的理解及预测能力。通过对大语言模型进行广泛的指令微调实验,我们验证了Mol-Instructions在增强大型模型应对复杂生物分子研究领域中的有效性,从而推动生物分子研究社区的进步。Mol-Instructions现已公开供持续研究使用,并将定期更新以增强其适用性。