Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
翻译:现实世界中的多元时间序列可能呈现出复杂的多尺度结构,包括全局趋势、局部周期性和非平稳状态,这使得长期预测具有挑战性。尽管稀疏混合专家方法提高了可扩展性和专业化程度,但它们通常依赖于同质的MLP专家,难以捕捉时间序列数据中多样的时序动态。我们通过MoHETS解决了这些局限性,这是一种仅包含编码器的Transformer模型,集成了稀疏混合异构专家层。MoHE将时序片段路由到一小部分专家网络,结合了一个共享的深度卷积专家(用于序列级连续性)和多个基于傅里叶的专家(用于片段级周期结构)。MoHETS进一步通过协变量片段嵌入的交叉注意力机制整合外生信息,增强了对非平稳动态的鲁棒性。最后,我们用轻量级卷积片段解码器取代了参数密集的线性投影头,提高了参数效率,减少了训练不稳定性,并使得单一模型能够泛化到任意预测范围。我们在七个多元基准测试和多个预测范围上进行了验证,MoHETS始终实现了最先进的性能,与近期强基线相比,平均MSE降低了$12\%$,证明了异构专业化在长期预测中的有效性。