Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git
翻译:可解释性是建立对大语言模型(LLM)信任的关键挑战,其根源在于从模型参数中提取推理过程的复杂性。我们提出框架表示假说,这是一个基于线性表示假说(LRH)的理论稳健框架,通过建模多词元词汇来解释和控制LLM。先前研究探索了LRH以连接LLM表示与语言概念,但仅限于单词元分析。由于大多数词汇由多个词元构成,我们将LRH扩展至多词元词汇,从而使其能够应用于包含数千概念的任意文本数据。为此,我们提出词汇可解释为框架——一种能更好捕捉词元-词汇关系的向量有序序列。进而,概念可表示为共享同一概念的词汇框架的平均值。我们通过Top-k概念引导解码展示这些工具,该方法能直观地使用选定概念引导文本生成。我们在Llama 3.1、Gemma 2和Phi 3系列模型上验证了上述构想,揭示了性别与语言偏见,暴露了有害内容,同时展现了修复这些问题的潜力,从而导向更安全、更透明的LLM。代码发布于https://github.com/phvv-me/frame-representation-hypothesis.git