This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.
翻译:本文研究了多语言大语言模型中有害信息的传播,并评估了多种遗忘方法的有效性。我们证明,虚假信息无论以何种语言存在,一旦通过训练数据引入这些模型,便能在不同语言间传播,损害生成内容的完整性与可靠性。我们的研究结果表明,通常专注于英语数据的标准遗忘技术在缓解多语言环境中有害内容传播方面效果不足,甚至可能无意间强化跨语言的有害内容。我们发现,只有同时处理英语和有害数据原始语言中的有害回应,才能有效消除所有语言的生成内容。这突显了制定全面遗忘策略的迫切需求,这些策略需考虑现代大语言模型的多语言特性,以提升其在不同语言环境中的安全性与可靠性。