Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.
翻译:大型预训练语言模型(PLMs)已被证明是现代自然语言处理系统中的关键组成部分。由于灾难性遗忘现象,PLMs通常需要在特定任务的下游数据集上进行微调,这使得PLMs所有权的声明和开发者知识产权的保护变得困难。我们证明,通过嵌入由所有者定义的特定输入触发的后门,可以在多任务学习框架下为PLMs添加水印;即便这些带有水印的PLMs在多个下游任务上进行微调,这些水印也难以被移除。除了使用一些罕见词作为触发器外,我们还展示了常用词的组合也可用作后门触发器,以避免被轻易检测。在多个数据集上的大量实验表明,嵌入的水印能够以高成功率被稳健提取,且受后续微调影响较小。