Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.
翻译:基于Transformer和CNN的方法在长时序预测中展现出优异性能,但其较高的计算与存储需求制约了大规模部署。为突破此限制,我们提出通过知识蒸馏(KD)将轻量级MLP与先进架构相融合。初步研究表明,不同模型能捕捉互补的时序模式,特别是在时域与频域中的多尺度与多周期模式。基于此发现,我们提出跨架构蒸馏框架TimeDistill,将教师模型(如Transformer、CNN)中的模式知识迁移至MLP。此外,我们通过理论分析证明该蒸馏方法可视为混合数据增强的特殊形式。TimeDistill将MLP性能提升最高达18.6%,在八个数据集上超越教师模型,同时实现最高7倍的推理加速与130倍的参数缩减。大量实验进一步验证了TimeDistill的通用性与有效性。