超越P值：为学习型医疗系统中的AI引入量化金融的风险与遗憾度量 (Beyond P-Values: Importing Quantitative Finance's Risk and Regret Metrics for AI in Learning Health Systems)

The increasing deployment of artificial intelligence (AI) in clinical settings challenges foundational assumptions underlying traditional frameworks of medical evidence. Classical statistical approaches, centered on randomized controlled trials, frequentist hypothesis testing, and static confidence intervals, were designed for fixed interventions evaluated under stable conditions. In contrast, AI-driven clinical systems learn continuously, adapt their behavior over time, and operate in non-stationary environments shaped by evolving populations, practices, and feedback effects. In such systems, clinical harm arises less from average error rates than from calibration drift, rare but severe failures, and the accumulation of suboptimal decisions over time. In this perspective, we argue that prevailing notions of statistical significance are insufficient for characterizing evidence and safety in learning health systems. Drawing on risk-theoretic concepts from quantitative finance and online decision theory, we propose reframing medical evidence for adaptive AI systems in terms of time-indexed calibration stability, bounded downside risk, and controlled cumulative regret. We emphasize that this approach does not replace randomized trials or causal inference, but complements them by addressing dimensions of risk and uncertainty that emerge only after deployment. This framework provides a principled mathematical language for evaluating AI-driven clinical systems under continual learning and offers implications for clinical practice, research design, and regulatory oversight.

翻译：人工智能（AI）在临床环境中的日益普及，对传统医学证据框架所依赖的基础假设提出了挑战。以随机对照试验、频率派假设检验和静态置信区间为核心的经典统计方法，是为在稳定条件下评估固定干预措施而设计的。相比之下，AI驱动的临床系统持续学习，随时间调整其行为，并在由不断演变的人群、实践和反馈效应塑造的非平稳环境中运行。在此类系统中，临床危害更多地源于校准漂移、罕见但严重的故障以及次优决策随时间的累积，而非平均错误率。本文认为，主流的统计显著性概念不足以描述学习型医疗系统中的证据和安全性。借鉴量化金融和在线决策理论中的风险理论概念，我们建议将自适应AI系统的医学证据重新定义为时间索引的校准稳定性、有界下行风险和受控累积遗憾。我们强调，这种方法并非取代随机试验或因果推断，而是通过解决仅在部署后才显现的风险和不确定性维度来补充它们。该框架为评估持续学习下的AI驱动临床系统提供了一种原则性的数学语言，并对临床实践、研究设计和监管监督提出了启示。