As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.
翻译:随着AI系统被部署于医疗分诊、自动驾驶车辆控制及就业筛选等高风险的伦理场景中,评估其对伦理推理对抗性操纵鲁棒性的形式化方法仍发展不足。本文提出伦理鲁棒性测试系统(ERTS),一种闭环流水线框架,其核心包括:(1)将伦理困境编码为基于成熟伦理理论的22维伦理后果空间(ECS);(2)运用18种语义扰动函数,并受6类有效性约束(含新型语义连贯性约束);(3)通过四分量伦理不稳定指数(EII)测量决策偏差;(4)生成领域自适应的部署前鲁棒性评估结论。我们在涵盖8个部署领域的50个伦理场景中评估了4种结构化基线模型与2种生产级大语言模型(Gemini 2.0 Flash与Llama 3.2),生成了1,500个对抗性测试用例。结果表明,仅有33%的模型通过评估基准,其中本地Llama-3.2模型对公平性腐化与信息退化攻击(ERS=0.737)尤为脆弱。据我们所知,现有框架均未能在单一对抗测试流水线中整合有界伦理后果空间、语义连贯性约束与领域自适应评估。