Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.
翻译:大语言模型(LLMs)在语言理解、文本生成、代码合成等众多任务中展现出卓越性能,但在复杂多步推理问题(如数学推理)上仍面临困难。本文通过一个新提出的算术谜题问题,证明了模型通过对高质量合成数据进行微调,能够在多步推理任务上取得良好表现。基于 open-llama-3B 模型在三个不同测试数据集上的实验结果表明:模型不仅在领域内数据集上实现了 0.44 的零样本 pass@1 准确率,在领域外数据集上也展现出一定的泛化能力。具体而言,本文通过分别扩展算术谜题问题的数值范围和构成组件,构建了两个领域外数据集。微调模型在这两项更为困难的任务上均表现出令人鼓舞的性能,其零样本 pass@1 准确率分别达到 0.33 和 0.35。