From Morality Installation in LLMs to LLMs in Morality-as-a-System

Work on morality in large language models (LLMs) has progressed via constitutional AI, reinforcement learning from human feedback (RLHF) and systematic benchmarking, yet it still lacks tools to connect internal moral representations to regulatory obligations, to design cultural plurality across the full development stack, and to monitor how moral properties drift over the lifecycle of a deployed system. These difficulties reflect a shared root. Morality is installed in a model at training time. I propose instead a morality-as-a-system framework, grounded in Niklas Luhmann's social systems theory, that treats LLM morality as a dynamic, emergent property of a sociotechnical system. Moral behaviour in a deployed LLM is not fixed at training. It is continuously reproduced through interactions among seven structurally coupled components spanning the neural substrate, training data, alignment procedures, system prompts, moderation, runtime dynamics, and user interface. This is a conceptual framework paper, not an empirical study. It philosophically reframes three known challenges, the interpretability-governance gap, the cross-component plurality problem, and the absence of lifecycle monitoring, as structural coupling failures that the installation paradigm cannot diagnose. For technical researchers, it explores three illustrative hypotheses about cross-component representational inconsistency, representation-level drift as an early safety signal, and the governance advantage of lifecycle monitoring. For philosophers and governance specialists, it offers a vocabulary for specifying substrate-level monitoring obligations within existing governance frameworks. The morality-as-a-system framework does not displace elements such as constitutional AI or RLHF it embeds them within a larger temporal and structural account and specifies the additional infrastructure those methods require.

翻译：摘要：大语言模型中的道德研究已通过宪法式人工智能、基于人类反馈的强化学习及系统性基准测试取得进展，但仍缺乏将内部道德表征与监管义务相连接的工具、缺乏在完整开发栈中设计文化多样性的手段，以及缺乏监控已部署系统生命周期中道德属性漂移的方法。这些困境反映出一个共同根源：道德是在训练阶段被植入模型的。本文提出一种替代性框架——"作为系统的道德"，其理论基础源自尼克拉斯·卢曼的社会系统理论，将大语言模型的道德视为社会技术系统中动态涌现的属性。已部署大语言模型的道德行为并非在训练中固定，而是通过神经基质、训练数据、对齐流程、系统提示、审核机制、运行时动态和用户界面等七个结构耦合组件间的持续互动不断再生产。本文属于概念性框架论文而非实证研究。它从哲学层面重新阐释了三个已知挑战：可解释性-治理鸿沟、跨组件多元性问题以及生命周期监控缺失——这些问题在植入范式下无法被诊断。针对技术研究者，本文提出三个说明性假设：跨组件表征不一致性、表征级漂移作为早期安全信号，以及生命周期监控的治理优势。面向哲学家与治理专家，本文提供了在现有治理框架内定义基质级监控义务的术语体系。"作为系统的道德"框架并非取代宪法式人工智能或基于人类反馈的强化学习等要素，而是将其嵌入更大的时间与结构叙事中，并明确这些方法所需的基础设施补充。