ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography

Generative linguistic steganography (GLS) enables covert communication by embedding secret messages into the natural language generation process. In practical deployment, however, GLS is vulnerable to tokenization ambiguity: the same surface text may be re-tokenized into a different token sequence at the receiver, breaking the shared decoding state between the communicating parties so that a single local mismatch can propagate into complete extraction failure. Existing solutions either remove ambiguous tokens -- distorting the generation distribution and compromising security -- or preserve the distribution at the cost of substantially reduced embedding capacity or prohibitive runtime overhead. To address this issue, we propose ReTokSync (Re-Tokenization Synchronization), a self-synchronizing disambiguation framework that monitors the receiver-view tokenization during generation and triggers a corrective reset only when ambiguity actually occurs. By confining the effect of tokenization ambiguity to sparse residual bit errors rather than global desynchronization, ReTokSync leaves ambiguity-free positions entirely untouched and remains compatible with the underlying steganographic algorithm. Experiments on both English and Chinese settings show that ReTokSync stays closest to the steganographic baseline in distributional security (zero KL divergence), text quality, embedding capacity, and runtime, while achieving extraction accuracy above 99.7\%. Building on this property, we further develop a two-channel covert communication mechanism in which ReTokSync serves as the primary channel and a reliable auxiliary channel corrects the remaining errors, achieving 100\% end-to-end recovery across all evaluated configurations.

翻译：[译摘要] 生成式语言隐写通过将秘密信息嵌入自然语言生成过程实现隐蔽通信。然而在实际部署中，生成式语言隐写易受分词歧义影响：同一表观文本在接收端可能被重新分词为不同标记序列，从而破坏通信双方共享的解码状态，导致单点局部失配传播为完全提取失败。现有方法或移除歧义标记——扭曲生成分布并损害安全性，或牺牲嵌入容量或引入过高运行时开销以保持分布。为解决此问题，我们提出ReTokSync（重新分词同步），一种自同步消歧框架，在生成过程中监测接收端视角的分词状态，仅当歧义实际发生时触发纠正性重置。通过将分词歧义的影响限制为稀疏残留比特错误而非全局失同步，ReTokSync保持无歧义位置完全不变，且与底层隐写算法兼容。在英文与中文场景下的实验表明，ReTokSync在分布安全性（零KL散度）、文本质量、嵌入容量及运行时间方面最接近隐写基线，同时提取准确率超过99.7%。基于此特性，我们进一步开发双通道隐蔽通信机制，以ReTokSync为主通道、可靠辅助通道纠正剩余错误，实现所有评估配置下100%的端到端恢复。