Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
翻译:现实世界的多元时间序列可能呈现出复杂的多尺度结构,包括全局趋势、局部周期性以及非平稳状态,这使得长期预测具有挑战性。尽管稀疏的专家混合方法提升了可扩展性和专业化程度,但它们通常依赖于同质的MLP专家,难以有效捕捉时间序列数据中多样的时序动态。我们通过MoHETS解决了这些局限性,这是一种仅包含编码器的Transformer模型,集成了稀疏的异构专家混合层。MoHE将时序片段路由至一小部分专家网络,其中结合了一个共享的深度卷积专家以保持序列级连续性,以及多个基于傅里叶的专家以处理片段级周期性结构。MoHETS进一步通过对协变量片段嵌入进行交叉注意力机制引入外生信息,增强了对非平稳动态的鲁棒性。最后,我们用轻量级的卷积片段解码器取代了参数密集的线性投影头,从而提高了参数效率,减少了训练不稳定性,并允许单一模型泛化至任意预测长度。我们在七个多元基准数据集和多个预测长度上进行了验证,MoHETS始终取得最先进的性能,相较于近期强基线模型,平均MSE降低了$12\%$,证明了其在长期预测中异构专家专业化的有效性。