Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.
翻译:大型语言模型(LLM)能够解决日益复杂的推理任务,却在基础数字理解与处理(例如9.11 > 9.9)上出现令人意外的错误。后一种能力对于解决复杂算术和数学问题至关重要,也是大多数推理任务的基础,但先前研究很少关注该能力,或仅讨论若干受限任务(如整数加法)。本文系统探究了LLM的数字理解与处理能力(NUPA)。首先,我们提出了一个涵盖四种常见数字表示形式、四大类别中17种不同数字任务的基准测试,共形成41种有意义的组合。这些任务源自中小学课程体系,覆盖了几乎所有日常数字理解与处理场景,且任务规则简单明确。通过该基准测试,我们发现当前LLM在多项任务中频繁失败。为探究此问题,我们采用现有及潜在增强NUPA的技术(如分词器、位置编码和数字格式)训练小型模型,并利用我们的测试平台全面评估其有效性。我们还在提出的NUPA任务上对实用规模LLM进行微调,发现:1)朴素微调能在多数(但非全部)任务上显著提升NUPA;2)令人意外的是,专为增强NUPA设计的技术在微调预训练模型时效果不佳。我们进一步探究了思维链技术对NUPA的影响。本研究为LLM的NUPA提供了更细致全面的理解。我们的基准测试与代码已发布于https://github.com/GraphPKU/number_cookbook。