This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.
翻译:本文探讨人工智能系统如何将语义结构编码至其表征空间的几何结构中。本文的启发性观察在于,这些表征空间的自然几何应反映模型利用表征生成行为的方式。我们聚焦于定义软最大化分布的表征这一重要特例。在此情形下,我们论证其自然几何应为信息几何。研究重点在于信息几何对语义编码及线性表征假说的作用机制。作为示例性应用,我们提出"对偶调控"方法——一种利用线性探测器稳健调控表征以呈现特定概念的技术。我们证明该方法在最小化非目标概念变动的同时,能最优修正目标概念。实证研究表明,对偶调控可显著提升概念操控的可控性与稳定性。