The rapid advancement of Large Language Models (LLMs) and their potential integration into autonomous driving systems necessitates understanding their moral decision-making capabilities. While our previous study examined four prominent LLMs using the Moral Machine experimental framework, the dynamic landscape of LLM development demands a more comprehensive analysis. Here, we evaluate moral judgments across 51 different LLMs, including multiple versions of proprietary models (GPT, Claude, Gemini) and open-source alternatives (Llama, Gemma), to assess their alignment with human moral preferences in autonomous driving scenarios. Using a conjoint analysis framework, we evaluated how closely LLM responses aligned with human preferences in ethical dilemmas and examined the effects of model size, updates, and architecture. Results showed that proprietary models and open-source models exceeding 10 billion parameters demonstrated relatively close alignment with human judgments, with a significant negative correlation between model size and distance from human judgments in open-source models. However, model updates did not consistently improve alignment with human preferences, and many LLMs showed excessive emphasis on specific ethical principles. These findings suggest that while increasing model size may naturally lead to more human-like moral judgments, practical implementation in autonomous driving systems requires careful consideration of the trade-off between judgment quality and computational efficiency. Our comprehensive analysis provides crucial insights for the ethical design of autonomous systems and highlights the importance of considering cultural contexts in AI moral decision-making.
翻译:大型语言模型(LLM)的快速发展及其在自动驾驶系统中的潜在应用,要求我们深入理解其道德决策能力。尽管我们先前的研究已利用道德机器实验框架考察了四种主流LLM,但LLM领域的动态发展需要更全面的分析。本研究评估了51种不同LLM的道德判断,包括专有模型(GPT、Claude、Gemini)的多个版本及开源替代模型(Llama、Gemma),以衡量它们在自动驾驶场景中与人类道德偏好的契合度。通过联合分析框架,我们评估了LLM响应与人类在道德困境中偏好的接近程度,并检验了模型规模、更新版本和架构的影响。结果显示,专有模型及参数量超过100亿的开源模型与人类判断表现出相对较高的契合度,其中开源模型的参数量与偏离人类判断的程度呈显著负相关。然而,模型更新并未持续提升与人类偏好的契合度,且许多LLM表现出对特定伦理原则的过度强调。这些发现表明,虽然增加模型规模可能自然催生更类人的道德判断,但在自动驾驶系统中的实际应用仍需审慎权衡判断质量与计算效率。我们的综合分析为自动驾驶系统的伦理设计提供了关键见解,并强调了在人工智能道德决策中纳入文化背景考量的重要性。