Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
翻译:大语言模型正逐步展现出隐写能力。此类能力可能使未对齐模型规避监督机制,然而目前缺乏系统性的检测与量化此类行为的方法。经典隐写术定义及其检测方法要求已知非隐写信号的参考分布。对于大语言模型中的隐写推理而言,获取这样的参考分布并不可行,这导致上述方法难以应用。本文提出一种替代方案——**决策理论视角下的隐写术**。核心洞察在于:隐写术会在能够与无法解码隐藏内容(存在于隐写信号中)的主体之间产生可用信息的不对称性,而这种原本潜在的对称性可以通过主体的可观测行为推断得出。为形式化这一视角,我们引入广义$\mathcal{V}$-信息:一种用于测量输入中可用信息量的功利主义框架。基于此,我们定义了**隐写差距**——通过比较隐写信号对能/不能解码隐藏内容主体的下游效用差异来量化隐写程度的指标。我们通过实验验证了该形式化方法的有效性,并展示了其在检测、量化及缓解大语言模型隐写推理问题中的应用。