As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
翻译:随着大语言模型在全球应用中的日益普及,确保其在多样化语言环境中实现无毒性输出仍是一项关键挑战。本文探索"跨语言去毒化"这一跨语言范式,该范式通过毒性缓解机制,使去毒化能力能够在不同文字体系的高资源与低资源语言之间实现迁移。我们通过392组扩展实验场景评估跨语言去毒化的有效性,在有限数据条件下衡量跨分布场景中的毒性降低效果,并探究缓解策略对模型在非毒性任务上性能的影响,从而揭示安全性保障与知识保留之间的权衡关系。我们的代码与数据集已在https://github.com/himanshubeniwal/Breaking-mBad公开。