Prior research in representation engineering has revealed that LLMs encode concepts within their representation spaces, predominantly centered around English. In this study, we extend this philosophy to a multilingual scenario, delving into multilingual human value concepts in LLMs. Through our comprehensive exploration covering 7 types of human values, 16 languages and 3 LLM series with distinct multilinguality, we empirically substantiate the existence of multilingual human values in LLMs. Further cross-lingual analysis on these concepts discloses 3 traits arising from language resource disparities: cross-lingual inconsistency, distorted linguistic relationships, and unidirectional cross-lingual transfer between high- and low-resource languages, all in terms of human value concepts. Additionally, we validate the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language. Drawing from our findings on multilingual value alignment, we prudently provide suggestions on the composition of multilingual data for LLMs pre-training: including a limited number of dominant languages for cross-lingual alignment transfer while avoiding their excessive prevalence, and keeping a balanced distribution of non-dominant languages. We aspire that our findings would contribute to enhancing the safety and utility of multilingual AI.
翻译:先前的表征工程研究揭示,大型语言模型(LLM)在其表征空间中编码概念,且这些概念主要围绕英语。本研究将此理念拓展至多语言场景,深入探究LLM中的多语言人类价值概念。通过涵盖7种人类价值类型、16种语言及3个具有不同多语言特性的LLM系列的全面探索,我们实证性地证实了LLM中存在多语言人类价值概念。进一步的跨语言概念分析揭示了因语言资源差异而产生的三种特征:跨语言不一致性、扭曲的语言关系,以及高资源语言与低资源语言之间的单向跨语言迁移——这些均围绕人类价值概念展开。此外,我们验证了以主导语言为源语言、跨语言控制LLM价值对齐能力的可行性。基于多语言价值对齐的研究发现,我们审慎地就LLM预训练多语言数据的构成提出建议:纳入有限数量的主导语言以实现跨语言对齐迁移,同时避免其过度普遍存在,并保持非主导语言的均衡分布。我们希望这些发现能为提升多语言人工智能的安全性与实用性做出贡献。