Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.
翻译:近期关于可证明安全的神经语言隐写研究忽略了一个关键问题:发送者必须对隐写文本进行去分词化处理以避免引起窃听者的怀疑。基于子词的语言模型在神经语言隐写实现中普遍存在分词歧义问题,导致解码失败。当前解决方案通过修改候选词的概率分布来消歧,但这与可证明安全的隐写方法不兼容。我们提出一种名为SyncPool的新型安全消歧方法,有效解决了分词歧义问题。该方法在隐写嵌入算法运行前,将所有具有前缀关系的候选词分组,消除歧义词间的不确定性。通过部署共享的密码学安全伪随机数生成器(CSPRNG),接收者能够同步发送者的采样过程,从歧义词池中选择特定词元。SyncPool不改变候选池规模及词元分布,因此适用于可证明安全的语言隐写方法。我们提供了理论证明,并通过实验验证了该方法在多语言和多模型场景下的适用性,表明其能显著提升神经语言隐写系统的可靠性与安全性。