We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens. We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at https://github.com/sanyalsunny111/LLM-Inheritune.
翻译:我们研究了一种从现有大型基础语言模型出发开发小型基础语言模型(LM)的简单方法的有效性:首先从较大的LM中继承少量Transformer块,然后使用该较大模型原始预训练数据中极小一部分(0.1%)对较小模型进行训练。我们将这种简单方法命名为Inheritune,并通过使用10亿tokens(以及从30亿参数较大LM继承的起始几层)构建一个15亿参数的小型基础LM来首次验证其效果;整个过程仅使用单张A6000 GPU,耗时不到半天。在9个不同的评估数据集以及MMLU基准测试中,所得到的模型与公开的10亿至20亿参数基础模型相比具有竞争力,其中部分模型使用了多50至1000倍的tokens进行训练。我们在另一种设置下进一步研究了Inheritune:利用较大LM及其完整预训练数据集来训练小型LM。结果表明,利用GPT2-medium(3.55亿参数)和GPT-2-large(7.7亿参数)的部分层训练得到的小型LM,在OpenWebText数据集(含90亿tokens)上经过相同训练步数后,其验证损失可有效匹配从头训练的较大对应模型。我们通过大量实验分析了该方法,并证明了其在多种设置下的有效性。代码已开源至https://github.com/sanyalsunny111/LLM-Inheritune。