We propose that small pretrained foundational generative language models with millions of parameters can be utilized as a general learning framework for sequence-based tasks. Our proposal overcomes the computational resource, skill set, and timeline challenges associated with training neural networks and language models from scratch. Further, our approach focuses on creating small and highly specialized models that can accurately execute a challenging task of which the base model is incapable of performing. We demonstrate that 125M, 350M, and 1.3B parameter pretrained foundational language models can be instruction fine-tuned with 10,000-to-1,000,000 instruction examples to achieve near state-of-the-art results on challenging cheminformatics tasks. We also demonstrate the role of successive language model fine-tuning epochs on improved outcomes, as well as the importance of both data formatting and pretrained foundational language model selection for instruction fine-tuning success.
翻译:我们提出,具有数百万参数的小型预训练基础生成式语言模型可用作序列任务的通用学习框架。本方案克服了从头训练神经网络和语言模型所面临的计算资源、技能集和时间线挑战。此外,我们的方法专注于构建能够精确执行基础模型无法胜任的挑战性任务的小型高度专业化模型。我们证明,参数规模为1.25亿、3.5亿和13亿的预训练基础语言模型,可通过1万至100万条指令示例进行指令微调,在具有挑战性的化学信息学任务上达到接近最优的结果。我们还展示了连续语言模型微调轮次对改进结果的促进作用,以及数据格式化和预训练基础语言模型选择对指令微调成功的重要性。