We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.
翻译:本研究以LLM表示的内在维度作为语言复杂度的标记,探究不同LLM层级间的内在维度轮廓是否能够差异化表征形式复杂度与功能复杂度。我们发现,包含多重并列或从属从句的句子之间的形式对比,会反映在内在维度差异中,且其起始阶段与先前研究独立识别的更抽象语言处理阶段相吻合。而右分支结构与中心嵌套结构、或明确与非明确关系从句依附关系所表征的句子间的功能对比,同样能被内在维度捕捉,但表现相对不明显,且未与同一处理阶段相关联。通过表征相似性与层级消融的进一步实验验证了相同趋势。我们得出结论:内在维度是LLM中语言复杂度的有效标记,能够区分不同类型的复杂度,并指示不同LLM间相似的语言处理阶段。