One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
翻译:大型语言模型(LLMs)研究中的一个开放问题是它们能否模拟人类的伦理推理,并作为人类判断的可信代理。为探究此问题,我们引入了一个包含196个真实世界伦理困境与专家意见的基准数据集,每个案例被划分为五个结构化部分:引言、关键因素、历史理论视角、解决策略与核心要点。由于非专家人类回应的简洁性,我们仅收集了其在关键因素部分的回答以供比较。我们使用基于BLEU、Damerau-Levenshtein距离、TF-IDF余弦相似度和通用句子编码器相似度的复合指标框架,评估了多个前沿LLM(GPT-4o-mini、Claude-3.5-Sonnet、Deepseek-V3、Gemini-1.5-Flash)。通过基于逆序的排名对齐和成对层次分析法计算指标权重,实现了模型输出与专家回答的细粒度比较。结果表明,LLM在词汇和结构对齐方面普遍优于非专家人类,其中GPT-4o-mini在所有部分表现最为一致。然而,所有模型在历史背景理解和提出需要情境抽象的细致解决策略方面均存在困难。人类回应虽然结构性较弱,但偶尔能达到相当的语义相似度,这暗示了直觉性的道德推理。这些发现凸显了LLM在伦理决策中的优势与当前局限。