We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
翻译:我们展示了如何评估语言模型对基本道德概念的理解。我们引入了ETHICS数据集,这是一个新的基准测试,涵盖了正义、福祉、义务、美德和常识道德等概念。模型需预测关于多样化文本场景的广泛道德判断。这要求将物理与社会世界知识与价值判断联系起来,这种能力或使我们能够引导聊天机器人的输出,甚至最终规范开放式的强化学习智能体。通过ETHICS数据集,我们发现当前语言模型在预测人类基本伦理判断方面具备前景但尚不完善的能力。我们的研究表明,如今可以在机器伦理领域取得进展,并为实现与人类价值观对齐的人工智能奠定坚实基础。