How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

Yizhong Wang,Hamish Ivison,Pradeep Dasigi,Jack Hessel,Tushar Khot,Khyathi Raghavi Chandu,David Wadden,Kelsey MacMillan,Noah A. Smith,Iz Beltagy,Hannaneh Hajishirzi

from arxiv, 18 pages, 5 figure, 7 tables. Under the review of NeurIPS 2023 Datasets and Benchmarks Track

In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 83% of ChatGPT performance, and 68% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B T\"ulu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.

翻译：本研究探讨了近期在多个开放指令遵循数据集上对语言模型进行指令微调的进展。尽管有最新观点认为开放模型可与最先进的专有模型媲美，但这些主张常受限于有限的评估范围，导致难以全面比较各模型、判定不同资源的效用。我们提供了一系列规模从6.7B到65B参数不等的指令微调模型，这些模型在12个指令数据集上训练，涵盖人工筛选（如OpenAssistant）到合成与蒸馏（如Alpaca）类型，并通过自动评估、基于模型的评估和人工评估等多种指标，系统考察了模型在事实知识、推理能力、多语言处理、代码生成及开放式指令遵循方面的表现。此外，我们推出了Tülu——基于高质量开放资源组合微调的性能最佳指令微调模型套件。实验表明，不同指令微调数据集可发掘或增强特定技能，但没有任何单一数据集（或组合）能在所有评估中达到最优性能。有趣的是，我们发现基于模型和人类偏好的评估未能反映基准评估所揭示的模型能力差异，这凸显了本研究中系统性评估的必要性。评估结果显示，任意评估中表现最佳的模型平均达到ChatGPT性能的83%、GPT-4性能的68%，表明仍需在构建更优基础模型与指令微调数据方面加大投入方能缩小差距。我们在https://github.com/allenai/open-instruct 开放了包括完全微调65B Tülu在内的指令微调模型、代码、数据及评估框架，以促进未来研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

19+阅读 · 2021年9月17日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日