Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
翻译:尽管大型语言模型(LLM)发展迅速,但低资源语言的安全风险仍是关键挑战。现有安全数据集主要围绕英语构建,限制了多语言安全对齐的进展。因此,基于各自指令数据集微调的低资源专家模型,相较于高资源模型往往表现出更高的不安全率。本研究提出一种安全感知的层级交换方法,可在无需额外训练的情况下,将安全对齐能力从英语安全专家模型迁移至低资源语言专家模型。为提升迁移能力,该方法根据模块的专业化程度自适应地选择或融合模块。我们的方法在保持通用语言理解任务性能的同时,提升了目标语言的安全性。实验结果表明,所提方法在MMMLU、BELEBELE和MGSM等通用基准测试中达到与语言专家模型相当的性能,并在MultiJail安全基准测试中生成更对齐且危害性更低的响应。