Recent advancements in Large Language Models (LLMs) have revolutionized the AI field but also pose potential safety and ethical risks. Deciphering LLMs' embedded values becomes crucial for assessing and mitigating their risks. Despite extensive investigation into LLMs' values, previous studies heavily rely on human-oriented value systems in social sciences. Then, a natural question arises: Do LLMs possess unique values beyond those of humans? Delving into it, this work proposes a novel framework, ValueLex, to reconstruct LLMs' unique value system from scratch, leveraging psychological methodologies from human personality/value research. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs, synthesizing a taxonomy that culminates in a comprehensive value framework via factor analysis and semantic clustering. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system. Based on this system, we further develop tailored projective tests to evaluate and analyze the value inclinations of LLMs across different model sizes, training methods, and data sources. Our framework fosters an interdisciplinary paradigm of understanding LLMs, paving the way for future AI alignment and regulation.
翻译:大语言模型(LLMs)的最新进展不仅彻底改变了人工智能领域,也带来了潜在的安全与伦理风险。解读LLMs内嵌的价值观对于评估和降低其风险至关重要。尽管已有大量研究关注LLMs的价值观,但先前的工作主要依赖于社会科学中人类导向的价值观体系。由此自然产生一个问题:LLMs是否拥有超越人类的独特价值观?针对这一问题,本研究提出一个名为ValueLex的新框架,从零开始重建LLMs独特的价值观体系,并借鉴了人类人格/价值观研究中的心理学方法论。基于词汇学假设,ValueLex引入生成式方法,从30多个LLMs中挖掘多样化的价值观,通过因子分析和语义聚类整合出一个分类体系,最终形成一套综合性的价值观框架。我们识别出三个核心价值维度——能力、品格和正直,每个维度均包含具体的子维度,揭示出LLMs具备一套结构化但非人类的价值观体系。基于这一体系,我们进一步开发了定制化的投射测试,以评估和分析不同模型规模、训练方法和数据来源下LLMs的价值观倾向。我们的框架促进了理解LLMs的跨学科范式,为未来的人工智能对齐与监管奠定了基础。