The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, ranging from Reinforcement Learning with Human Feedback (RLHF), to constitutional learning, etc. there is an urgent need to understand the scope and nature of human values injected into these models before their release. There is also a need for model alignment without a costly large scale human annotation effort. We propose UniVaR, a high-dimensional representation of human value distributions in LLMs, orthogonal to model architecture and training data. Trained from the value-relevant output of eight multilingual LLMs and tested on the output from four multilingual LLMs, namely LlaMA2, ChatGPT, JAIS and Yi, we show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources. Through UniVaR, we explore how different LLMs prioritize various values in different languages and cultures, shedding light on the complex interplay between human values and language modeling.
翻译:大型语言模型(LLMs)在各类任务和领域的广泛应用,使得将这些模型与人类价值观和偏好对齐变得至关重要。鉴于人类价值对齐方法的多样性,从基于人类反馈的强化学习(RLHF)到宪法学习等,在模型发布前理解注入这些模型的人类价值观的范围和性质已迫在眉睫。同时,也需要一种无需昂贵大规模人工标注工作的模型对齐方法。我们提出了UniVaR,一种与模型架构和训练数据正交的、用于表示LLMs中人类价值分布的高维表征方法。该方法基于八个多语言LLMs的价值相关输出进行训练,并在四个多语言LLMs(即LlaMA2、ChatGPT、JAIS和Yi)的输出上进行测试。结果表明,UniVaR是一个强大的工具,可用于比较嵌入不同LLMs、源自不同语言的人类价值分布。通过UniVaR,我们探究了不同LLMs如何在不同语言和文化中优先考虑各种价值观,从而揭示了人类价值观与语言建模之间复杂的相互作用。