As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
翻译:随着AI系统的发展,我们越来越依赖它们与我们共同做出决策,或为我们做出决策。为确保此类决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需理解它们是如何做出这些决策的。能够提供最终响应及(部分透明的)中间思考轨迹的推理语言模型,为研究AI程序性推理提供了及时契机。与数学和编程问题(通常存在客观正确结果)不同,道德困境是过程导向评估的绝佳试验场,因其允许多种有依据的结论。为此,我们提出MoReBench:包含1000个道德场景,每个场景配有一组评分标准,这些标准由专家认为在推理该场景时必须纳入(或避免)的核心要素组成。MoReBench涵盖超过2.3万条标准,包括识别道德考量、权衡利弊、给出可操作建议,以覆盖AI为人类提供道德建议及自主做出道德决策的情形。此外,我们专门整理了MoReBench-Theory:150个实例,用于测试AI能否在规范伦理学五大主要框架下进行推理。研究结果表明,规模法则及数学、编程和科学推理任务上的现有基准无法预测模型进行道德推理的能力。模型还对特定道德框架(如边沁式行动功利主义与康德道义论)表现出偏好,这可能是主流训练范式的副作用。综上,这些基准推动了面向更安全、更透明AI的过程导向推理评估。