Linguistic Steganography (LS) tasks aim to generate steganographic texts (stego) based on secret information. Only authorized recipients can perceive the existence of secret information in the texts and accurately extract it, thereby preserving privacy. However, the controllability of the stego generated by existing schemes is poor, and the generated stego is difficult to contain specific discourse characteristics such as style, genre, and theme. As a result, the stego are often easily detectable, compromising covert communication. To address these problems, this paper proposes a novel scheme named LLsM, a generative LS based on a Large Language Model (LLM). We fine-tuned the LLM LLaMA2 with a large-scale constructed dataset encompassing rich discourse characteristics, which enables the fine-tuned LLM to generate texts with specific discourse in a controllable manner. Then the discourse characteristics are used as guiding information and inputted into the fine-tuned LLM in the form of Prompt together with secret information. The candidate pool, derived from sampling and truncation, undergoes range encoding to ensure the stego imitate natural text distribution. Experiments demonstrate that LLsM performs superior to prevalent baselines regarding text quality, statistical analysis, discourse matching, and anti-steganalysis. In particular, LLsM's MAUVE surpasses that of some baselines by 70%-80%, and its anti-steganalysis performance is 30%-40% higher. Notably, we also present the long stego generated by LLsM, showing its potential superiority in long LS tasks.
翻译:语言隐写任务旨在根据秘密信息生成隐写文本。仅授权接收者能感知文本中隐藏信息的存在并精确提取,从而保护隐私。然而,现有方案生成的隐写文本可控性较差,难以包含风格、体裁、主题等特定话语特征,导致隐写文本易被检测,危及隐蔽通信。针对上述问题,本文提出一种名为LLsM的新方案——基于大语言模型的生成式语言隐写术。我们使用大规模构建的、包含丰富话语特征的数据集对LLaMA2大语言模型进行微调,使微调后的模型能以可控方式生成具有特定话语特征的文本。随后将话语特征作为引导信息,与秘密信息共同以提示形式输入微调模型。通过采样与截断得到的候选池经区间编码处理,确保隐写文本模仿自然文本分布。实验表明,LLsM在文本质量、统计分析、话语匹配及反隐写分析方面均显著优于主流基线方法。尤其值得注意的是,LLsM的MAUVE指标相较部分基线提升70%-80%,反隐写分析性能提升30%-40%。此外,我们展示了LLsM生成的长隐写文本,彰显其在长文本隐写任务中的潜在优势。