Prior research in representation engineering has revealed that LLMs encode concepts within their representation spaces, predominantly centered around English. In this study, we extend this philosophy to a multilingual scenario, delving into multilingual human value concepts in LLMs. Through our comprehensive exploration covering 7 types of human values, 16 languages and 3 LLM series with distinct multilinguality, we empirically substantiate the existence of multilingual human values in LLMs. Further cross-lingual analysis on these concepts discloses 3 traits arising from language resource disparities: cross-lingual inconsistency, distorted linguistic relationships, and unidirectional cross-lingual transfer between high- and low-resource languages, all in terms of human value concepts. Additionally, we validate the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language. Drawing from our findings on multilingual value alignment, we prudently provide suggestions on the composition of multilingual data for LLMs pre-training: including a limited number of dominant languages for cross-lingual alignment transfer while avoiding their excessive prevalence, and keeping a balanced distribution of non-dominant languages. We aspire that our findings would contribute to enhancing the safety and utility of multilingual AI.
翻译:先前表征工程领域的研究已揭示,大型语言模型在其表征空间中编码概念,且这些概念主要以英语为中心。本研究将该理念扩展至多语言场景,深入探究大语言模型中的多语言人类价值概念。通过涵盖7类人类价值、16种语言及3个具有不同多语言能力的大模型系列的全面探索,我们实证性地证实了大语言模型中多语言人类价值的存在。针对这些概念的跨语言分析进一步揭示了由语言资源差异引发的三种特征:跨语言不一致性、语言关系扭曲性以及高资源语言与低资源语言之间单向的跨语言迁移性——所有这些特征均体现在人类价值概念层面。此外,我们验证了利用主导语言作为源语言对大语言模型价值对齐能力进行跨语言控制的可行性。基于对多语言价值对齐的研究发现,我们审慎提出关于大语言模型预训练中多语言数据构成的建议:应包含有限的主导语言以实现跨语言对齐迁移,同时避免其过度占比,并保持非主导语言的均衡分布。我们期待这些发现能够有助于提升多语言人工智能的安全性与实用性。