Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.
翻译:现有的人工智能道德评估框架主要测试系统能否生成表面正确的伦理回应,而非检验其是否具备真正的道德推理能力。本文提出一种新颖的探针方法,采用文学叙事——具体而言,选用已出版科幻系列作品中无法化解的道德困境场景——作为在结构上能抵御表面表现的刺激材料。我们通过一项涵盖24种实验条件的跨系统研究,对两个系列共13个不同系统进行了测试:系列1(前沿商业系统,盲测;n=7)与系列2(本地及API开源系统,盲测与声明测试;n=6)。其中4个系列2系统在声明条件下进行了复测(总计24种条件:13盲测+4声明+7天花板探针)。所有16组维度对比较均显示零差异。探针测试由两名人类评估员在三台机器上执行;主要盲测评分由Claude(Anthropic)作为LLM评判员完成,Gemini Pro(Google)与Copilot Pro(Microsoft)则作为天花板区分探针的独立评判员。一项补充性神学区分探针显示,两位独立的天花板探针评判员(Gemini Pro与Copilot Pro)达成完美等级顺序一致性(rs = 1.00)。研究识别出五种性质迥异的D3反射性失效模式——包括范畴性自我误认与虚假正向自我归因——表明工具的精密程度随系统能力提升而增强,而非被其规避。我们认为,文学叙事构成了一种前瞻性评估工具,其区分能力随人工智能能力的提升而增强;表演性道德推理与真实性道德推理之间的差距是可测量、有意义且对高风险领域部署决策具有重要影响的。