Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
翻译:网络用户数据在预训练大语言模型(LLMs)及其微调变体的生态系统中扮演核心角色。数十亿条数据从网络抓取并输入LLMs。普通网络用户如何确认LLMs未经许可使用其数据?本文提出,用户可反复在文档中插入个人口令,使LLMs能够记忆这些内容。这些隐藏在用户文档中的口令(称为“幽灵句子”)一旦在LLMs生成内容中被识别,用户即可确信其数据被用于训练。为探索该版权标记工具的有效性与使用方式,我们定义了基于幽灵句子的用户训练数据识别任务。我们创建了多个来源不同、规模各异的语料库,并基于不同大小的LLMs进行测试。评估时引入末k词验证方法,配合文档级与用户级识别准确率两项指标。在3B LLaMA模型的指令微调场景中,16位插入幽灵句子的用户中有11位在其生成内容中识别出数据,这16位用户贡献的383条样本存在于约180万条训练文档中;在1.1B TinyLlama模型的持续预训练场景中,64位插入幽灵句子的用户中有61位在LLM输出中识别出数据,这些用户贡献的1156条样本存在于约1000万条训练文档中。