Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
翻译:大型语言模型开始展现出隐写能力。这种能力可能使未对齐模型规避监督机制。然而,目前缺乏检测和量化此类行为的系统性方法。经典的隐写定义及其检测方法需要已知非隐写信号的参考分布。对于LLM中的隐写推理场景,获取此类参考分布并不可行,导致这些方法无法适用。我们提出了一种**基于决策理论的隐写分析视角**。核心观点在于:隐写会在能够解码隐藏内容(存在于隐写信号中)与无法解码的智能体之间产生可用信息的不对称性,而这种潜在的不对称性可通过智能体的可观测行为进行推断。为形式化这一视角,我们提出广义$\mathcal{V}$-信息:一种用于度量输入中可用信息量的效用主义框架。基于此,我们定义了**隐写间隙**——通过比较隐写信号对可解码与不可解码智能体的下游效用来量化隐写的指标。我们通过实验验证了该形式化体系,并证明其可用于检测、量化和缓解LLM中的隐写推理行为。