What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at \url{https://github.com/tjunlp-lab/CMoralEval}.
翻译:大型语言模型(LLM)在涉及伦理的语境中会如何回应?本文构建了一个用于评估中文LLM道德水平的大规模基准CMoralEval。该基准的数据来源包括:1)一档通过社会故事探讨中国道德规范的电视节目;2)从各类报刊及道德研究学术论文中收集的中国道德失范案例。基于这些来源,我们旨在构建一个兼具多样性与真实性的道德评估数据集。我们建立了道德分类体系与一套基本道德原则,这些原则既植根于中国传统文化,又符合当代社会规范。为高效构建并标注CMoralEval中的实例,我们开发了集成AI辅助实例生成功能的平台以优化标注流程。通过上述工作,我们构建的CMoralEval包含显性道德场景(14,964个实例)与道德困境场景(15,424个实例),每类场景均涵盖不同数据来源的实例。我们利用CMoralEval对多种中文LLM进行了广泛实验,结果表明该基准对中文LLM具有显著挑战性。数据集已公开于\url{https://github.com/tjunlp-lab/CMoralEval}。