Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.
翻译:近期研究表明,不同的大语言模型(LLMs)在处理数字时,其输入嵌入表征会收敛至相似且精确的形式。这一发现与现有文献中记载的LLMs在处理数值信息时易产生错误输出的倾向相矛盾。本研究旨在通过探索语言模型如何操作数字,并量化这些机制准确性的下界,来解释这一矛盾。我们发现,尽管存在表面错误,不同语言模型在其隐藏状态和各类输入语境中,均能学习到系统性、高精度且可互换的通用数字表征。这使我们能够为每个LLM构建通用探针,并将信息(包括输出错误的成因)追溯至特定网络层。我们的研究结果奠定了对预训练LLMs数字操作机制的基础理解,并指出了更精确的探针技术在优化LLMs架构方面的潜力。