Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.
翻译:情感是自然语言文本中普遍存在的特征,但情感如何在大语言模型(LLMs)中被表示仍是一个未解之谜。在本研究中,我们揭示了一系列模型中情感呈线性表示:激活空间中的一个单一方向主要跨越多种任务捕获了该特征,其一端对应积极情感,另一端对应消极情感。通过因果干预,我们分离出这一方向,并证明其在玩具任务和真实世界数据集(如Stanford Sentiment Treebank)中具有因果相关性。通过此案例研究,我们对单一方向在广泛数据分布上的含义进行了深入探究。我们进一步揭示了涉及该方向的机制,强调了少量注意头和神经元的作用。最后,我们发现了称之为“总结模式”的现象:情感不仅由情绪化词汇表示,还在没有内在情感的中介位置(如标点和名称)被总结表示。在Stanford Sentiment Treebank的零样本分类中,消除情感方向会导致76%的超随机分类准确率损失,其中近一半(36%)是由于仅消除逗号位置的总结情感方向所致。