Evidence Sufficiency Under Delayed Ground Truth: Proxy Monitoring for Risk Decision Systems

from arxiv, 25 pages, 11 tables, 56 references. Software: Evidence Sufficiency Calculator (doi:10.5281/zenodo.19233931) and Governance Drift Toolkit (doi:10.5281/zenodo.19236418)

Machine learning systems in fraud detection, credit scoring, and clinical risk assessment operate under delayed ground truth: outcome labels arrive days to months after the decision they evaluate. During this blind period, governance evidence degrades through mechanisms that neither drift detection methods nor governance frameworks adequately address. This paper formalizes an evidence sufficiency model with four dimensions (completeness, freshness, reliability, representativeness) and a decision-readiness gate that quantifies how label latency degrades evidence quality. The model maps three drift types to dimension-specific degradation trajectories. A complementary proxy indicator framework comprising seven measurement categories estimates sufficiency degradation without labels, with explicit coverage mapping and characterized blind spots per drift type. Evaluation on the IEEE-CIS Fraud Detection dataset (~590K transactions) with controlled drift injection shows that composite proxy monitoring detects covariate and mixed drift with 100% detection rate, while concept drift without feature change remains undetected -- consistent with the theoretical impossibility of unsupervised detection when P(X) is unchanged. Blind period simulation confirms monotone sufficiency degradation, with concept drift degrading fastest (S=0.242 at day 60 vs 0.418 for no-drift). The framework contributes a governance sufficiency monitoring instrument; its value lies in translating drift signals into auditable sufficiency assessments with characterized blind spots. Mapping sufficiency levels to governance actions requires deployment-specific calibration beyond this study's scope.

翻译：机器学习系统在欺诈检测、信用评分及临床风险评估中面临延迟真实标签的挑战：决策评估所依据的结果标签需在决策做出后数日至数月才能获取。在此期间，治理证据通过漂移检测方法及治理框架均未充分应对的机制持续退化。本文形式化定义包含四个维度（完整性、时效性、可靠性、代表性）的证据充分性模型，并构建决策就绪门控机制以量化标签延迟对证据质量的侵蚀程度。该模型将三种漂移类型映射至维度特异性退化路径，同时提出包含七个测量类别的互补性代理指标框架，可在无标签条件下估计充分性退化程度，并明确标注覆盖映射及各类漂移对应的盲区特征。在IEEE-CIS欺诈检测数据集（约59万笔交易）上实施受控漂移注入的实验表明：组合式代理监控可100%检测协变量漂移与混合漂移，而特征不变时的概念漂移无法被检测——这与P(X)不变条件下无监督检测的理论不可能性结论一致。盲期仿真验证了充分性单调退化现象，其中概念漂移退化速率最快（第60日S=0.242，对比无漂移时的0.418）。本框架提供了治理充分性监控工具，其价值在于将漂移信号转化为可审计的充分性评估结果并明确盲区特征。将充分性等级映射至治理行动需要超出本文研究范围的部署场景特异性校准。