Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

翻译：组合推理深度预测临床AI失效：与电子病历问答中Transformer组合性局限一致的实证证据

Sanjay Basu

from arxiv, 20 pages, 5 figures. Code: https://github.com/sanjaybasu/compositional-depth-clinical-ehr

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

翻译：总体准确率基准掩盖了大语言模型在电子健康记录（EHR）问答中失效的系统性结构：需要更多推理步骤的问题会产生不成比例的更多错误。受Transformer组合性局限的理论结果启发，我们引入预定义的跳数分类法——从EHR回答临床问题所需的不同推理步骤数量——作为模型失效的基于原则的预测因子。我们在四个跳数级别上标注了313个由临床医生生成的MedAlign EHR问答对，并在模型内消融（claude-sonnet-4-6，零样本vs.扩展思考）和跨架构复制（gpt-4o和gpt-5.4-2026-03-05，零样本）中评估了301个问题。跨越两个提供商和两个OpenAI代际（GPT-4和GPT-5）的所有三个模型均显示出准确率随跳数单调下降：Claude Sonnet零样本从30.6%（跳数=1）降至17.6%（跳数=4）（Cochran-Armitage z=-2.30，p=0.011；每跳OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o复制了此结果（37.8%降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；gpt-5.4-2026-03-05证实了此结果（37.8%降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。预定义的上下文充分性审计显示，高跳数问题并未因EHR截断而受到差异化不利影响（跳数2-4的可回答性为93-95%，跳数1为79%），因此准确率下降反映了组合推理难度。扩展思考在三个推理条件下并未显著平缓准确率-深度曲线，且思考令牌使用量与跳数呈正相关（r=0.31，p<0.0001），与预测的O(k)计算需求一致。因此，跳数是一个理论驱动、跨架构的大语言模型在EHR问答中错误的预测因子，对临床AI的部署风险分层具有直接影响。