We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.
翻译:我们构建了一个定制的Transformer模型来研究神经网络如何在电车式困境中做出道德决策。该模型通过嵌入编码受影响对象、人数及其所属结果的结构化场景进行处理。我们的双层架构在道德机器数据集上达到77%的准确率,同时保持足够小的规模以支持精细分析。通过多种可解释性技术,我们揭示了道德推理在网络中的分布机制,并证明偏见会局部化于不同的计算阶段。