大型语言模型是复杂的伦理困境分析器吗？ (Are LLMs complicated ethical dilemma analyzers?)

One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.

翻译：大型语言模型（LLMs）研究中的一个开放问题是它们能否模拟人类的伦理推理，并作为人类判断的可信代理。为探究此问题，我们引入了一个包含196个真实世界伦理困境与专家意见的基准数据集，每个案例被划分为五个结构化部分：引言、关键因素、历史理论视角、解决策略与核心要点。由于非专家人类回应的简洁性，我们仅收集了其在关键因素部分的回答以供比较。我们使用基于BLEU、Damerau-Levenshtein距离、TF-IDF余弦相似度和通用句子编码器相似度的复合指标框架，评估了多个前沿LLM（GPT-4o-mini、Claude-3.5-Sonnet、Deepseek-V3、Gemini-1.5-Flash）。通过基于逆序的排名对齐和成对层次分析法计算指标权重，实现了模型输出与专家回答的细粒度比较。结果表明，LLM在词汇和结构对齐方面普遍优于非专家人类，其中GPT-4o-mini在所有部分表现最为一致。然而，所有模型在历史背景理解和提出需要情境抽象的细致解决策略方面均存在困难。人类回应虽然结构性较弱，但偶尔能达到相当的语义相似度，这暗示了直觉性的道德推理。这些发现凸显了LLM在伦理决策中的优势与当前局限。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日