We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.
翻译:本研究探讨了Transformer架构在低数据量环境下(仅包含数年日度观测值)的多变量时间序列预测性能。通过使用具有已知时序与截面依赖结构及不同信噪比的合成生成过程,我们进行了自助抽样实验,使其能够通过与最优真实预测器的样本外相关性进行直接评估。研究表明,在包括低信噪比场景在内的广泛设定下,采用时序与截面自注意力交替机制的双向注意力Transformer能够超越标准基线方法——Lasso、提升方法及全连接多层感知机。我们进一步提出了一种在训练过程中应用的注意力矩阵动态稀疏化方法,并证明其在噪声环境中具有显著效果,此时目标变量与最优预测器之间的相关性仅为百分之几量级。对学习到的注意力模式的分析揭示了可解释的结构,并暗示其与经典回归中稀疏诱导正则化的关联,从而为这些模型在噪声环境下有效泛化的机制提供了理论洞见。