With the rise and widespread use of Large Language Models (LLMs), ensuring their safety is crucial to prevent harm to humans and promote ethical behaviors. However, directly assessing value valence (i.e., support or oppose) by leveraging large-scale data training is untrustworthy and inexplainable. We assume that emulating humans to rely on social norms to make moral decisions can help LLMs understand and predict moral judgment. However, capturing human values remains a challenge, as multiple related norms might conflict in specific contexts. Consider norms that are upheld by the majority and promote the well-being of society are more likely to be accepted and widely adopted (e.g., "don't cheat,"). Therefore, it is essential for LLM to identify the appropriate norms for a given scenario before making moral decisions. To this end, we introduce a novel moral judgment approach called \textit{ClarityEthic} that leverages LLMs' reasoning ability and contrastive learning to uncover relevant social norms for human actions from different perspectives and select the most reliable one to enhance judgment accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in moral judgment tasks. Moreover, human evaluations confirm that the generated social norms provide plausible explanations that support the judgments. This suggests that modeling human moral judgment with the emulating humans moral strategy is promising for improving the ethical behaviors of LLMs.
翻译:随着大型语言模型(LLMs)的兴起和广泛应用,确保其安全性对于防止人类受到伤害和促进道德行为至关重要。然而,直接通过大规模数据训练来评估价值效价(即支持或反对)并不可靠且难以解释。我们假设,通过模拟人类依赖社会规范进行道德决策的方式,可以帮助LLMs理解和预测道德判断。然而,捕捉人类价值观仍然是一个挑战,因为在特定情境下多个相关规范可能相互冲突。考虑到那些被多数人遵守并能促进社会福祉的规范更可能被接受和广泛采用(例如“不作弊”),LLM在进行道德决策前必须识别出适用于特定情境的恰当规范。为此,我们提出了一种名为\textit{ClarityEthic}的新型道德判断方法,该方法利用LLMs的推理能力和对比学习技术,从不同视角揭示人类行为的相关社会规范,并选择最可靠的规范以提高判断准确性。大量实验表明,我们的方法在道德判断任务中优于现有最先进方法。此外,人工评估证实,所生成的社会规范能够为判断提供合理的解释性依据。这表明,通过模拟人类道德策略来建模人类道德判断,对于提升LLMs的道德行为具有积极前景。