We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.
翻译:我们构建了一个定制的Transformer模型,用于研究神经网络如何在电车式困境中做出道德决策。该模型通过嵌入表示处理结构化场景,这些嵌入编码了受影响对象、人数及其所属结果类别。我们的两层架构在道德机器数据集上达到77%的准确率,同时保持足够小的规模以支持精细分析。通过多种可解释性技术,我们揭示了道德推理在网络中的分布机制,并证明认知偏差会定位在特定的计算阶段,这仅是研究发现之一。