How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.
翻译:我们如何构建能够快速且安全地学习任何个体人类价值观的人工智能系统,避免在学习过程中造成伤害或违反社会可接受行为标准?本文探讨了人类与智能体之间表征对齐对人类价值观学习的影响。使人工智能系统学习类人的世界表征具有诸多已知优势,包括提升泛化能力、增强对领域迁移的鲁棒性以及提高少样本学习性能。我们证明这种表征对齐同样能够支持在个性化场景中安全地学习和探索人类价值观。我们首先提出理论预测,证明其适用于学习人类道德判断,随后通过多臂老虎机实验框架(其中奖励反映人类对所选行为的价值判断)将结果推广至人类价值观的十个不同维度——包括伦理、诚实与公平——并针对每组价值观训练智能体。基于一组文本化行为描述,我们收集了人类的价值判断数据,以及人类与多种语言模型的相似性判断数据,最终证明表征对齐能够在学习人类价值观时实现安全探索与泛化性能的提升。