Making moral judgments is an essential step toward developing ethical AI systems. Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. These approaches have been criticized for overgeneralizing the moral stances of a limited group of annotators and lacking explainability. This work proposes a flexible top-down framework to steer (Large) Language Models (LMs) to perform moral reasoning with well-established moral theories from interdisciplinary research. The theory-guided top-down framework can incorporate various moral theories. Our experiments demonstrate the effectiveness of the proposed framework on datasets derived from moral theories. Furthermore, we show the alignment between different moral theories and existing morality datasets. Our analysis exhibits the potential and flaws in existing resources (models and datasets) in developing explainable moral judgment-making systems.
翻译:做出道德判断是开发伦理人工智能系统的关键步骤。当前主流方法大多采用自下而上的实现方式,即利用大量标注数据,基于众包道德观点训练模型。这些方法因过度泛化有限标注群体的道德立场且缺乏可解释性而受到批评。本研究提出一种灵活的自上而下框架,旨在引导(大型)语言模型(LMs)基于跨学科研究中成熟的道德理论进行道德推理。该理论引导的自上而下框架能够整合多种道德理论。实验结果表明,所提框架在基于道德理论构建的数据集上具有显著有效性。此外,我们揭示了不同道德理论与现有道德数据集之间的内在关联。分析工作展现了现有资源(模型与数据集)在开发可解释道德判断系统中的潜力与缺陷。