Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.
翻译:近期研究表明,大型语言模型(LLM)的深层对表征学习的贡献甚微,通常可以在移除后不造成显著性能损失。然而,这类结论通常基于狭窄的评估维度,可能忽略了模型行为的重要方面。本研究通过评估协议、任务类别和模型架构等多个维度,对深度利用进行了系统性分析。我们的分析证实,极深层通常比浅层效果更弱,但其贡献程度随评估设置存在显著差异。在基于似然性且不涉及生成的评估指标下,剪除多数层仍能保持性能,仅最初几层具有关键作用。相比之下,基于生成的评估揭示了中深层在实现推理和保持长程连贯性方面不可或缺的作用。我们进一步发现知识与检索能力集中于浅层组件,而推理准确性则高度依赖深层——但可通过蒸馏进行重塑。这些结果表明LLM的深度利用具有高度异质性和情境依赖性,强调在解释与压缩大模型时需要结合任务、指标和模型特性的综合视角。