Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template -- a data generating Python function -- to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model's generations adhering better to template rules.

翻译：在指令数据集上对大语言模型（LLM）进行微调是提升其生成能力的常用方法。然而，手动构建指令数据集成本高昂且耗时，而由LLM生成的数据虽能减少人力投入，却可能违反用户隐私协议或LLM提供商的服务条款。因此，我们寻求一种构建指令数据集的途径，使其样本既非人工生成也非LLM生成，同时仍能提升LLM的生成能力。本文中，我们提出了Cookbook框架，该框架通过程序化生成由随机词元上简单模式构成的训练数据，从而提供了一种可扩展、高性价比且能规避法律与隐私问题的方法。首先，Cookbook使用模板——一种数据生成的Python函数——来生成训练数据，该数据促使模型学习与期望任务相对应的显式基于模式的规则。我们发现，在Cookbook生成的数据上进行微调，能将其对应任务的性能提升高达52.7个准确率点。其次，由于指令数据集能同时提升多个下游任务的性能，Cookbook通过算法学习如何混合来自不同模板的数据，以优化多任务性能。在标准的多任务GPT4ALL评估套件上，使用Cookbook生成的数据集进行微调的Mistral-7B模型，与其他经过指令微调的7B参数模型相比，平均准确率最高，并在8项任务中的3项上表现最佳。最后，我们分析了Cookbook在何时以及为何能提升性能，并提出了一种度量指标，使我们能够验证性能提升在很大程度上源于模型生成结果更好地遵循了模板规则。