Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.
翻译:模型适应对于处理代理训练数据与实际用户接收数据之间的差异至关重要。为了有效进行适应,用户的文本数据通常存储在服务器或其本地设备上,下游自然语言处理(NLP)模型可直接利用此类领域内数据进行训练。然而,这可能带来隐私和安全问题,因为存在将用户信息泄露给对手的额外风险。近期已有研究探讨用通用标记替换文本数据中的身份标识信息。本文利用大语言模型(LLMs)为掩码标记建议替代词,并评估其在下游语言建模任务中的有效性。具体而言,我们提出了多种基于预训练和微调LLM的方法,并在多个数据集上开展实证研究以比较这些方法。实验结果表明,在混淆语料上训练的模型能够取得与在原始数据上训练的模型相当的性能,且无需进行隐私保护标记掩码。