Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.
翻译:大型语言模型(LLMs)在基础算术方面已展现出令人瞩目的熟练度,在标准数值任务上达到了媲美人类的表现。然而,当数值表达偏离其训练语料库中普遍存在的惯例时,这些模型的性能如何却鲜有关注。在本研究中,我们考察了模型在广泛数字脚本与格式下的数值推理能力。研究表明,尽管底层数学推理完全相同,但当数值输入以训练语料中代表性不足的脚本或格式呈现时,LLMs的准确性会显著下降。我们进一步证明,针对性的提示策略(如少样本提示和显式数字映射)能够大幅缩小这一差距。我们的发现揭示了多语言数值推理中一个被忽视的挑战,并为可靠地使用LLMs跨不同数字脚本和格式风格进行数字解释、操作与生成提供了可行的见解。