We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.
翻译:我们提出Goat——一个经微调的LLaMA模型,在多项算术任务上显著超越GPT-4。通过在合成数据集上进行微调,Goat在BIG-bench算术子任务上达到最优性能。尤其值得注意的是,零样本的Goat-7B模型在准确率上可与甚至超越少样本条件下PaLM-540B的表现。令人惊讶的是,仅通过监督微调,Goat即可在大数加法和减法任务上实现近乎完美的准确率——而这对于此前Bloom、OPT、GPT-NeoX等预训练语言模型几乎是不可能的。我们将Goat卓越的性能归因于LLaMA在数字分词处理上的一致性。为攻克大数乘法和除法等更具挑战性的任务,我们提出了一种基于任务可学习性的分类方法,并利用基本算术原理将多位数乘除法等不可学习任务拆解为一系列可学习子任务。我们深入分析了模型性能,对提出的分解步骤的有效性进行了全面评估。此外,Goat-7B可轻松通过LoRA技术在24GB显存的GPU上进行训练,便于其他研究者复现。我们已公开发布模型、数据集及数据集生成的Python脚本。