Prior research has revealed that certain abstract concepts are linearly represented as directions in the representation space of LLMs, predominantly centered around English. In this paper, we extend this investigation to a multilingual context, with a specific focus on human values-related concepts (i.e., value concepts) due to their significance for AI safety. Through our comprehensive exploration covering 7 types of human values, 16 languages and 3 LLM series with distinct multilinguality (e.g., monolingual, bilingual and multilingual), we first empirically confirm the presence of value concepts within LLMs in a multilingual format. Further analysis on the cross-lingual characteristics of these concepts reveals 3 traits arising from language resource disparities: cross-lingual inconsistency, distorted linguistic relationships, and unidirectional cross-lingual transfer between high- and low-resource languages, all in terms of value concepts. Moreover, we validate the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language. Ultimately, recognizing the significant impact of LLMs' multilinguality on our results, we consolidate our findings and provide prudent suggestions on the composition of multilingual data for LLMs pre-training.
翻译:先前的研究表明,某些抽象概念在大型语言模型(LLM)的表示空间中呈现为线性方向,且相关研究主要围绕英语展开。本文将此研究扩展至多语言语境,并特别聚焦于与人类价值观相关的概念(即价值概念),因其对人工智能安全具有重要意义。通过涵盖7类人类价值观、16种语言以及3个具有不同多语言特性(例如单语、双语和多语言)的LLM系列的全面探索,我们首先通过实证确认了LLM中以多语言形式存在的价值概念。对这些概念的跨语言特性进行的进一步分析揭示了由语言资源差异导致的3个特征:价值概念层面的跨语言不一致性、扭曲的语言关系,以及高资源语言与低资源语言之间的单向跨语言迁移。此外,我们验证了利用主导语言作为源语言,对LLM的价值对齐能力进行跨语言控制的可行性。最后,认识到LLM的多语言特性对我们结果的显著影响,我们整合了研究发现,并就LLM预训练中多语言数据的构成提供了审慎的建议。