LLsM: Generative Linguistic Steganography with Large Language Model

Linguistic Steganography (LS) tasks aim to generate steganographic text (stego) based on secret information. Only authorized recipients can perceive the existence of the stegos and extract secrets, thereby preserving privacy. However, existing LS methods do not consider the controllable generation of stegos containing specific discourses such as style, genre, and theme. And they are difficult to simulate high-quality natural texts. As a result, the stegos are easily perceived and detectable, compromising covert communication. This paper proposes the LLsM, the first LS work with the Large Language Model (LLM). Regarding open-source LLMs, we reconstruct the token generator of LLM to the "stego generator" so that it can control the generation of stego based on the secret. In this "stego generator", the candidate pool is encoded by range coding, and the adjustment factor for the interval length is also given. The secret determines the interval, thereby determining the next token. This better simulates the distribution of natural texts and controls the adjustment of the embedding rate. In addition, we preliminarily built an LLsM-c architecture for closed-source LLMs. It encodes discourse to obtain high-quality prompts containing discourse based on secrets, and generates pure natural texts containing discourse. Experiments show that LLsM performs superior to prevalent LS and related-task baselines regarding various kinds of concealment and anti-steganalysis. LLsM's MAUVE surpasses baselines by 60%-80% and anti-steganalysis exceeds baselines by 20%-30%. Notably, LLsM can also generate longer stegos with high quality, showing its advantages in understanding and coherence.

翻译：语言隐写（LS）任务旨在根据秘密信息生成隐写文本（stego）。只有授权接收者能感知stego的存在并提取秘密，从而保护隐私。然而，现有LS方法未考虑包含特定话语（如风格、体裁、主题）的stego的可控生成，且难以模拟高质量自然文本。因此，stego易被感知和检测，导致隐蔽通信受损。本文提出LLsM，这是首个基于大语言模型（LLM）的LS研究工作。针对开源LLMs，我们将LLM的词元生成器重构为“stego生成器”，使其能基于秘密控制stego的生成。在该“stego生成器”中，候选池通过区间编码进行编码，并给出区间长度的调节因子。秘密决定区间，进而决定下一词元。这更好地模拟了自然文本分布，并控制了嵌入率的调节。此外，我们初步构建了面向闭源LLMs的LLsM-c架构，它通过编码话语获得包含基于秘密的话语的高质量提示，并生成包含话语的纯自然文本。实验表明，LLsM在多种隐蔽性和反隐写分析方面优于主流LS及相关任务基线。LLsM的MAUVE指标超过基线60%-80%，反隐写分析能力超过基线20%-30%。值得注意的是，LLsM还能生成长且高质量的stego，展现出其在理解与连贯性方面的优势。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日