Large language models (LLMs) have shown remarkable capability in natural language tasks, yet debate persists on whether they truly comprehend deep structure (i.e., core semantics) or merely rely on surface structure (e.g., presentation format). Prior studies observe that LLMs' performance declines when intervening on surface structure, arguing their success relies on surface structure recognition. However, surface structure sensitivity does not prevent deep structure comprehension. Rigorously evaluating LLMs' capability requires analyzing both, yet deep structure is often overlooked. To this end, we assess LLMs' comprehension ability using causal mediation analysis, aiming to fully discover the capability of using both deep and surface structures. Specifically, we formulate the comprehension of deep structure as direct causal effect (DCE) and that of surface structure as indirect causal effect (ICE), respectively. To address the non-estimability of original DCE and ICE -- stemming from the infeasibility of isolating mutual influences of deep and surface structures, we develop the corresponding quantifiable surrogates, including approximated DCE (ADCE) and approximated ICE (AICE). We further apply the ADCE to evaluate a series of mainstream LLMs, showing that most of them exhibit deep structure comprehension ability, which grows along with the prediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs rely more on deep structure, while open-source LLMs are more surface-sensitive, which decreases with model scale. Theoretically, ADCE is a bidirectional evaluation, which measures both the sufficiency and necessity of deep structure changes in causing output variations, thus offering a more comprehensive assessment than accuracy, a common evaluation in LLMs. Our work provides new insights into LLMs' deep structure comprehension and offers novel methods for LLMs evaluation.
翻译:大型语言模型(LLMs)在自然语言任务中展现出卓越能力,但其究竟真正理解深层结构(即核心语义)还是仅依赖表层结构(如呈现格式)仍存争议。先前研究观察到,当对表层结构进行干预时,LLMs的性能会下降,据此认为其成功依赖于表层结构识别。然而,对表层结构的敏感性并不妨碍深层结构理解。严谨评估LLMs的能力需要同时分析两者,但深层结构常被忽视。为此,我们采用因果中介分析来评估LLMs的理解能力,旨在全面揭示其利用深层与表层结构的能力。具体而言,我们将深层结构的理解定义为直接因果效应(DCE),将表层结构的理解定义为间接因果效应(ICE)。针对原始DCE与ICE因无法隔离深层与表层结构的相互影响而导致的不可估计问题,我们开发了相应的可量化替代指标,包括近似直接因果效应(ADCE)与近似间接因果效应(AICE)。我们进一步应用ADCE评估了一系列主流LLMs,结果表明大多数模型展现出深层结构理解能力,且该能力随预测准确率提升而增长。对比ADCE与AICE显示,闭源LLMs更依赖深层结构,而开源LLMs对表层结构更敏感,且这种敏感性随模型规模增大而减弱。理论上,ADCE是一种双向评估方法,它同时衡量深层结构变化导致输出变化的充分性与必要性,从而提供了比LLMs常用评估指标——准确率——更全面的评估。我们的工作为理解LLMs的深层结构理解能力提供了新视角,并为LLMs评估提供了新方法。