In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 83% of ChatGPT performance, and 68% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B T\"ulu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.
翻译:本研究探讨了近期在多个开放指令遵循数据集上对语言模型进行指令微调的进展。尽管有最新观点认为开放模型可与最先进的专有模型媲美,但这些主张常受限于有限的评估范围,导致难以全面比较各模型、判定不同资源的效用。我们提供了一系列规模从6.7B到65B参数不等的指令微调模型,这些模型在12个指令数据集上训练,涵盖人工筛选(如OpenAssistant)到合成与蒸馏(如Alpaca)类型,并通过自动评估、基于模型的评估和人工评估等多种指标,系统考察了模型在事实知识、推理能力、多语言处理、代码生成及开放式指令遵循方面的表现。此外,我们推出了Tülu——基于高质量开放资源组合微调的性能最佳指令微调模型套件。实验表明,不同指令微调数据集可发掘或增强特定技能,但没有任何单一数据集(或组合)能在所有评估中达到最优性能。有趣的是,我们发现基于模型和人类偏好的评估未能反映基准评估所揭示的模型能力差异,这凸显了本研究中系统性评估的必要性。评估结果显示,任意评估中表现最佳的模型平均达到ChatGPT性能的83%、GPT-4性能的68%,表明仍需在构建更优基础模型与指令微调数据方面加大投入方能缩小差距。我们在https://github.com/allenai/open-instruct 开放了包括完全微调65B Tülu在内的指令微调模型、代码、数据及评估框架,以促进未来研究。