Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.
翻译:大语言模型已展现出令人瞩目的道德推理能力,但在面对复杂多因素的道德困境时,其判断常出现显著分歧。为解决此类不一致性问题,本文提出一种框架,通过融合多个大语言模型的道德判断形成集体道德判断,并对显著偏离共识的模型进行校准。我们的聚合机制将连续的道德可接受性评分(超越二元标签)融合为集体概率,并依据模型可靠性对贡献进行加权。针对未对齐模型,我们设计了定向嵌入优化方法,通过对道德哲学理论相关的词元嵌入进行微调,在保持语义完整性的同时最小化其与共识之间的JS散度。基于大规模社会道德困境数据集的实验表明,该方法能够建立稳健的共识并提升个体模型的判断一致性。这些发现凸显了跨模型数据驱动道德对齐的价值,及其在构建更安全、更稳定人工智能系统方面的潜力。