This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.
翻译:本文深入分析了大型语言模型(LLMs),重点关注自然语言处理领域重要的开源基础模型LLaMA。我们并未通过其生成输出评估LLaMA,而是设计多项选择任务来探测其在推理和计算等高阶任务中的内在理解能力。研究从水平方向(比较不同规模模型)和垂直方向(评估不同层级)两个维度对模型进行检验。基于设计的探测任务,我们揭示了若干关键且非典型的发现:(1)在水平方向,单纯扩大模型规模几乎无法自动赋予模型额外的知识或计算能力;相反,扩大规模能提升推理能力(尤其是在数学问题求解中),并有助于减少幻觉现象,但这些改进仅在超过特定规模阈值时出现;(2)在垂直分析中,LLaMA底层缺乏充分的算术与事实知识,展现出逻辑思维、多语言处理和识别能力,而顶层则集中了大部分计算能力和现实世界知识。