Sequence-to-sequence architectures built upon recurrent neural networks have become a standard choice for multi-step-ahead time series prediction. In these models, the decoder produces future values conditioned on contextual inputs, typically either actual historical observations (ground truth) or previously generated predictions. During training, feeding ground-truth values helps stabilize learning but creates a mismatch between training and inference conditions, known as exposure bias, since such true values are inaccessible during real-world deployment. On the other hand, using the model's own outputs as inputs at test time often causes errors to compound rapidly across prediction steps. To mitigate these limitations, we introduce a new training paradigm grounded in reinforcement learning: a policy gradient-based method to learn an adaptive input selection strategy for sequence-to-sequence prediction models. Auxiliary models first synthesize plausible input candidates for the decoder, and a trainable policy network optimized via policy gradients dynamically chooses the most beneficial inputs to maximize long-term prediction performance. Empirical evaluations on diverse time series datasets confirm that our approach enhances both accuracy and stability in multi-step forecasting compared to conventional methods.
翻译:基于循环神经网络构建的序列到序列架构已成为多步超前时间序列预测的标准选择。在这些模型中,解码器根据上下文输入生成未来值,这些输入通常为实际历史观测值(真实值)或先前生成的预测值。训练过程中,输入真实值有助于稳定学习过程,但会导致训练条件与推理条件不匹配(称为暴露偏差),因为在真实部署场景中无法获取此类真实值。另一方面,在测试阶段使用模型自身输出作为输入通常会导致误差随预测步长快速累积。为缓解这些局限性,我们提出一种基于强化学习的新训练范式:采用基于策略梯度的方法,为序列到序列预测模型学习自适应输入选择策略。辅助模型首先生成解码器的合理输入候选集,随后通过策略梯度优化的可训练策略网络动态选择最具效益的输入,以最大化长期预测性能。在多样化时间序列数据集上的实证评估表明,与传统方法相比,本方法在多步预测的准确性与稳定性方面均有提升。