Learning-to-defer (L2D) routes each decision to a system's own predictor or to an external expert. Streaming time-series settings break the offline-L2D assumptions: the data are non-stationary, expert availability shifts over time, and the internal predictor is trained online. We propose L2D-SLDS, a one-stage online L2D framework based on a factorized switching linear-Gaussian state-space model over all potential residuals: a discrete regime, a shared global factor, and per-expert idiosyncratic states. The always-observed internal residual continuously updates beliefs about every unqueried expert through the shared factor, and a learner-aware query score balances immediate cost against latent-state information gain and one-step learner improvement. We prove an oracle inequality against a time-varying learn-and-defer comparator, decomposing regret into a query-bonus budget, an SLDS predictive-cost-error term~$\mathcal{E}_{\mathrm{SLDS}}$, and the internal learner's interval dynamic regret. On synthetic, Melbourne, Jena, and 24-expert Delhi benchmarks, L2D-SLDS is competitive with or improves on contextual- and non-stationary-bandit baselines while deferring on ${<}2\%$ of real-data rounds.
翻译:学习-延迟策略(L2D)将每个决策路由至系统自身的预测器或外部专家。流式时间序列场景打破了离线L2D的假设:数据呈现非平稳特性、专家可用性随时间变化、内部预测器需在线训练。本文提出L2D-SLDS——一种基于因子化切换线性高斯状态空间模型的一阶段在线L2D框架,该模型覆盖所有潜在残差:离散模式、共享全局因子及每位专家的特质性状态。持续观测的内部残差通过共享因子不断更新未查询专家的信念分布,而学习者感知的查询分数将即时成本与潜在状态信息增益及单步学习改进相平衡。我们针对时变学习-延迟比较器证明了预言机不等式,将遗憾分解为查询奖励预算、SLDS预测成本误差项~$\mathcal{E}_{\mathrm{SLDS}}$及内部学习者的区间动态遗憾。在合成数据集、墨尔本、耶拿及24专家德里基准测试中,L2D-SLDS在仅有${<}2\%$的真实数据轮次执行延迟的情况下,与上下文及非平稳赌博机基线方法表现相当或更优。