Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving the LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions. To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.
翻译:指令微调是一种使大型语言模型(LLM)与人类意图对齐的流行范式。尽管其应用广泛,但在提升LLM以将现有基础模型与科学学科、概念及目标对齐方面,该思想的研究尚不充分。本文提出SciTune微调框架,以增强LLM遵循科学多模态指令的能力。为验证方法有效性,我们采用人工生成的科学指令微调数据集,训练了一个大型多模态模型LLaMA-SciTune,该模型连接视觉编码器与LLM,实现面向科学的视觉与语言理解。与仅使用机器生成数据微调的模型相比,LLaMA-SciTune在ScienceQA基准测试的平均得分及多个子类别上均超越人类表现。