The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.
翻译:心理健康AI的安全性往往在错误的时间尺度上被评估。当前的评估通常对孤立回应、最终结果或整体对话质量进行评分,而临床上的关键失败可能源于交互本身的顺序和累积效应,包括延迟升级、重复强化、依赖形成、修复失败以及跨轮次逐步恶化。本文认为,这种不匹配不仅仅是评估覆盖范围的局限,更是导致无效安全性结论的根源。我们引入"时间安全性不可辨识性"(Temporal Safety Non-Identifiability)这一形式化概念,说明为何依赖于序列、时序、累积或恢复特性的安全性属性无法通过丢弃这些特征的协议加以认证。基于这一形式化框架,我们提出SCOPE(基于保留证据的安全性声明)作为通用原则,用于对齐安全性声明与评估实际保留的证据,并将其实例化为SCOPE-MH——这一报告标准的心理健康领域具体实现。我们通过在专家标注的动机性访谈对话数据集AnnoMI上进行概念验证来操作化SCOPE-MH,揭示了逐轮行为评分无法表征的失败机制。我们主张将SCOPE-MH作为现有评估基础设施的诊断性补充,并论证:对于安全性关键的心理健康AI部署而言,保留时间证据的评估是必要而非可选的。