Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .
翻译:由于监督训练数据有限,大型语言模型(LLM)通常通过自监督的“预测下一个词”目标在大量非结构化文本数据上进行预训练。为使所得模型对用户有用,还需在规模小得多的“指令微调”数据上进行进一步训练,这些数据由指令与响应的监督训练样本组成。为克服监督数据量的限制,我们提出一种方法,能够将互联网规模预训练文档中的知识转化为数十亿个合成指令与答案训练对。所得数据集名为FineInstructions,使用了约1800万个基于真实用户编写的查询和提示创建的指令模板。这些指令模板与来自非结构化预训练语料库的人工撰写源文档进行匹配并实例化。通过在此规模生成的“监督式”合成训练数据,LLM可完全基于指令微调目标从头开始预训练,该目标与LLM预期的下游应用场景(响应用户提示)的分布更为一致。我们进行了受控的逐词训练实验,发现在衡量自由形式响应质量的标准基准测试中,基于FineInstructions的预训练表现优于标准预训练及其他已提出的合成预训练技术。相关资源可在 https://huggingface.co/fineinstructions 获取。