Despite outstanding performance in many tasks, language models are notoriously inclined to make factual errors in tasks requiring arithmetic computation. We address this deficiency by creating Calc-X, a collection of datasets that demonstrates the appropriate use of a calculator in reasoning chains. Calc-X is suitable for teaching language models to offload computations to a symbolic system. We survey and unify several existing chain-of-thought datasets into a proposed format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X collection to train open-source calculator-using models we call Calcformers and show that these models approximately double the accuracy of generating correct results compared to vanilla language model baselines. We make all Calc-X datasets, source code and Calcformers models publicly available.
翻译:摘要:尽管语言模型在许多任务中表现出色,但在需要算术计算的任务中,它们却容易产生事实性错误。我们通过创建Calc-X(一组展示在推理链中合理使用计算器的数据集)来解决这一缺陷。Calc-X适用于教导语言模型将计算任务卸载到符号系统上。我们调研并统一了多个现有的链式推理数据集,将其转化为所提议的格式,最终生成了一个包含超过30万个需要算术推理样本的标准数据集。最后,我们利用新的Calc-X集合训练了名为Calcformers的开源计算器使用模型,并表明这些模型在生成正确结果方面的准确率相比原始语言模型基线大约翻倍。我们将所有Calc-X数据集、源代码及Calcformers模型公开发布。