Large language models have transformed AI, yet reliably controlling their outputs remains a challenge. This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time. Unlike traditional methods using a single steering vector, we introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions. Conceptors act as soft projection matrices and offer more precise control over complex activation patterns. Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks. We further use Boolean operations on conceptors for combined steering goals that empirically outperform additively combining steering vectors on a set of tasks. These results highlight conceptors as a promising tool for more effective steering of LLMs. Our code is available on github.com/jorispos/conceptorsteering.
翻译:大语言模型已经改变了人工智能领域,但可靠地控制其输出仍是一个挑战。本文探讨了激活工程方法,即在推理时通过操纵预训练大语言模型的激活值来控制其输出。与传统方法使用单一导向向量不同,我们引入了conceptor——一种将激活向量集合表示为椭球区域的数学构造。Conceptor作为软投影矩阵,能够对复杂激活模式提供更精确的控制。我们的实验表明,在多种导向任务中,conceptor均优于传统方法。我们进一步通过对conceptor进行布尔运算来实现组合导向目标,在一系列任务上的实证表现优于通过加法组合导向向量的方法。这些结果表明conceptor是更有效引导大语言模型的有前景的工具。我们的代码发布于github.com/jorispos/conceptorsteering。