Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as special tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work takes a preliminary step towards understanding and improving NUPA of LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.
翻译:大型语言模型(LLM)能够解决日益复杂的推理任务,却在基础数字理解与处理方面(例如9.11 > 9.9)频繁出现令人意外的错误。后一种能力对于解决复杂算术和数学问题至关重要,也是大多数推理任务的基础,但以往研究对此关注甚少,或仅讨论若干受限任务(如整数加法)。本文系统研究了LLM的数字理解与处理能力(NUPA)。首先,我们构建了一个涵盖四种常见数字表示形式、四大类别下17项不同数字任务的基准测试集,共形成41组有意义的组合。这些任务源自中小学课程体系,覆盖了几乎所有日常数字理解与处理场景,且任务规则简洁明确。通过该基准测试,我们发现当前LLM在多项任务中频繁失败。为探究此问题,我们采用现有及潜在提升NUPA的技术(如特殊分词器、位置编码和数字格式)训练小型模型,并利用测试平台全面评估其有效性。同时,我们在提出的NUPA任务上对实用规模的LLM进行微调,发现:1)朴素微调能在多数(但非全部)任务上显著提升NUPA;2)值得注意的是,专为增强NUPA设计的技术在微调预训练模型时效果不佳。我们进一步探究了思维链技术对NUPA的影响。本研究为理解与提升LLM的数字理解与处理能力迈出了初步步伐。相关基准测试集与代码已发布于https://github.com/GraphPKU/number_cookbook。