This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.
翻译:本文深入分析了大规模语言模型(LLMs),重点关注自然语言处理领域中的著名开源基础模型LLaMA。我们并非通过其生成输出来评估LLaMA,而是设计多项选择任务,以探究其在推理与计算等高阶任务中的内在理解力。我们从横向(比较不同规模)和纵向(评估不同层级)两个维度对模型进行剖析。基于所设计的探测任务,我们揭示了若干关键且非寻常的发现:(1)从横向看,扩大模型规模几乎无法自动赋予其额外的知识或计算能力。然而,在超过特定规模阈值后,它能增强推理能力(尤其在数学问题求解中),并有助于减少幻觉;(2)从纵向分析,LLaMA底层缺乏充足的算术与事实知识,但展现出逻辑思维、多语言与识别能力,而顶层则集中了大部分计算能力与现实世界知识。