Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
翻译:尽管大语言模型(LLM)发展迅速,但安全风险仍是低资源语言面临的关键挑战。现有安全数据集主要围绕英语构建,限制了多语言安全对齐的进展。因此,在各自指令数据集上微调的低资源专家模型,相较于高资源模型往往表现出更高的不安全率。本研究提出一种安全感知的层交换方法,该方法可将安全对齐能力从英语安全专家模型迁移至低资源语言专家模型,且无需额外训练。为提升迁移能力,本方法根据模块的专业化程度自适应地选择或混合模块。该方法在保持通用语言理解任务性能的同时,增强了目标语言的安全性。实验结果表明,所提方法在MMMLU、BELEBELE和MGSM等通用基准测试中达到了与语言专家模型相当的性能,并在MultiJail安全基准测试中生成更对齐且危害性更低的响应。