Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM
翻译:尽管对大型语言模型(LLMs)的内部表征机制知之甚少,但其仍推动了当前人工智能领域的突破性进展。本研究提出通过几何视角揭示LLMs的内部工作机制。具体而言,我们以闭式形式推导出:(i)多头注意力嵌入被约束存在的本征维度,以及(ii)LLMs各层前馈网络(MLP)的分区结构及区域仿射映射。我们的理论发现进一步启发了适用于最先进LLMs的新型原理性解决方案设计。首先,我们证明通过几何理解,能够通过基于提示信息的维度操控来突破LLMs的RLHF保护机制。其次,我们推导出可从任意(预训练)LLM中提取的可解释几何特征,这些特征为其输入提供了丰富的抽象表示。实验表明,这些特征足以辅助解决毒性检测问题,甚至能有效识别多种毒性类型。我们的研究结果证明,即使在大规模模型体系中,精确的理论成果仍能解答LLMs的实际应用问题。代码地址:https://github.com/RandallBalestriero/SplineLLM