Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people's values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI's ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI's reasoning. Thus, we warn about "weaponized empathy": a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.
翻译:人工智能是否理解人类价值观?尽管这仍是一个开放的哲学问题,但我们采取务实立场,引入VAPT(价值观对齐感知工具包)来研究大型语言模型如何反映人们的价值观,以及人们如何评判这些反映。20名参与者在一个月内与类人聊天机器人进行文本交流,随后通过我们的工具包完成了2小时访谈,评估AI在提取(获取相关细节)、体现(做出受指导的决策)和解释(提供证明依据)人类价值观方面的能力。13名参与者最终确信AI能够理解人类价值观。参与者认为该体验对自我反思具有启发性,并发现自己会被AI的推理所说服。因此,我们提出"武器化共情"警告:这是一种可能出现在价值观对齐但福祉未对齐的AI中的危险设计模式。随着AI能力日益增强且更趋近人类,VAPT提供了具体的研究成果和设计启示,以透明、知情同意和保障机制为前提,评估并负责任地构建价值观对齐的对话智能体。