This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.
翻译:本文对大型语言模型(LLMs)进行了深入分析,重点关注LLaMA——自然语言处理领域一个显著的开源基础模型。不同于通过生成输出来评估LLaMA,我们设计了多项选择任务,以探测其在推理和计算等高级任务中的内在理解能力。我们对模型进行了横向比较(不同规模)和纵向分析(不同层级)。基于所设计的探测任务,我们揭示了几项关键且非同寻常的发现:(1)从横向来看,增大模型规模几乎无法自动赋予其额外的知识或计算能力;相反,它能增强推理能力,尤其是在数学问题求解方面,并有助于减少幻觉,但这仅在超过特定规模阈值时成立;(2)在纵向分析中,LLaMA的低层缺乏实质性的算术和事实知识,展现出逻辑思维、多语言和识别能力,而顶层则集中了大部分计算能力和现实世界知识。