The Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems

Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor -- homeostatic physiology, nominal industrial operation, stable network traffic -- this creates a systematic \emph{Stationarity Bias}: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emph{Stratified Stress-Test} that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed -- chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification -- we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE -- a phenomenon we term the \emph{RMSE Mirage}, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.

翻译：时间序列插补基准采用均匀随机掩码和形状无关指标（均方误差、均方根误差），隐含地按状态普遍性加权评估。在具有主导吸引子的系统（稳态生理状态、标称工业运行、稳定网络流量）中，这会产生系统性平稳性偏差：简单方法看似更优，因为基准主要采样易于处理的低熵状态，这些方法在此状态下能轻易成功。我们形式化定义了这种偏差，并提出分层压力测试，将评估划分为平稳状态与瞬变状态。以连续血糖监测作为测试平台（因其具有严格的真实强制函数（进食、胰岛素注射）可实现精确状态识别），我们得出具有广泛意义的三个发现：(i) 平稳效率：线性插值在稳定区间内达到最先进的重建效果，证实复杂架构在低熵状态下存在计算浪费；(ii) 瞬变保真度：在关键瞬变期间（餐后峰值、低血糖事件），线性方法的形态保真度（动态时间规整）急剧下降，与均方根误差不成比例——我们称之为均方根误差幻象，即低逐点误差掩盖了信号形态的破坏；(iii) 状态条件模型选择：深度学习模型在瞬变期间能同时保持逐点精度与形态完整性，使其对安全关键的下游任务不可或缺。我们进一步从临床试验中推导经验缺失分布，并将其应用于完整训练数据，防止模型利用不切实际的干净观测值，鼓励其在真实世界缺失情况下的鲁棒性。该框架可推广至任何常规平稳性主导关键瞬变的受调控系统。