The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.
翻译:大语言模型(LLM)在不同语言社区中的广泛部署,要求其具备可靠的多语言安全对齐能力。然而,现有将安全对齐扩展至其他语言的尝试通常需要大量资源,无论是通过目标语言的大规模高质量监督,还是通过与高资源语言进行成对对齐,这都限制了其可扩展性。本文提出一种资源高效的方法来改进多语言安全对齐。我们引入了一种即插即用的多语言一致性(MLC)损失函数,可集成到现有的单语对齐流程中。该方法通过提高多语言表示向量之间的共线性,在单次更新中促进了多语言语义层面的方向一致性。这使得仅使用多语言提示变体,而无需在低资源语言中提供额外的响应级监督,即可实现跨多种语言的同步对齐。我们在不同模型架构和对齐范式下验证了所提方法,并证明了其在增强多语言安全性方面的有效性,同时对模型的通用效用影响有限。跨语言和跨任务的进一步评估表明,该方法改善了跨语言泛化能力,为有限监督下的多语言一致性对齐提供了一种实用解决方案。