A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations.
翻译:“对齐”概念的一个核心关切隐含着“对齐于什么?”的问题。人工智能系统在全球范围内日益普及,但安全性对齐往往聚焦于同质的单语环境。此外,偏好训练与安全措施常过度拟合于西方中心数据集中常见的危害类型。本研究探讨了在平衡双重目标时不同对齐方法的可行性:既要处理和优化非均质的语言文化偏好集合,又要同时最小化全局与局部危害。我们收集了首套区分全局与局部危害的多语言人工标注红队提示数据集,该数据集可作为实验室环境,用于理解当面对跨地域和语言非平稳分布的偏好时对齐技术的可靠性。尽管现有文献主要关注英语危害缓解,极少涵盖此场景,但它真实反映了全球范围内人工智能系统的实际交互情况。我们在六种语言中建立了最先进对齐技术的新基准,且通用性能衰减最小。本研究为跨语言迁移及新型优化方法提供了重要见解,以保障服务于全球人口的人工智能系统的安全性。