Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.
翻译:近期关于可证明安全的神经语言隐写术研究忽略了一个关键方面:发送方必须对隐写文本进行去分词化处理,以避免引起窃听者的怀疑。基于子词的语言模型所引发的分词歧义问题,会导致所有基于此类模型的神经语言隐写实现中出现偶发性解码失败。当前针对该问题的解决方案涉及改变候选词的概率分布,这使得它们与可证明安全的隐写术不兼容。我们提出了一种名为SyncPool的新型安全消歧方法,可有效解决分词歧义问题。我们在隐写嵌入算法运行前,将候选池中所有具有前缀关系的词元进行分组,以消除歧义词元间的不确定性。为使接收方能够同步发送方的采样过程,我们部署了共享的密码学安全伪随机数生成器(CSPRNG)来从歧义池中选择词元。SyncPool既不改变候选池的规模,也不改变词元的分布,因此适用于可证明安全的语言隐写方法。我们提供了理论证明,并通过实验验证了该方案对多种语言和模型的适用性,表明其能显著提升神经语言隐写系统的可靠性与安全性。