Deep learning models, particularly recurrent neural networks and their variants, such as long short-term memory, have significantly advanced time series data analysis. These models capture complex, sequential patterns in time series, enabling real-time assessments. However, their high computational complexity and large model sizes pose challenges for deployment in resource-constrained environments, such as wearable devices and edge computing platforms. Knowledge Distillation (KD) offers a solution by transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student), thereby retaining high performance while reducing computational demands. Current KD methods, originally designed for computer vision tasks, neglect the unique temporal dependencies and memory retention characteristics of time series models. To this end, we propose a novel KD framework termed Memory-Discrepancy Knowledge Distillation (MemKD). MemKD leverages a specialized loss function to capture memory retention discrepancies between the teacher and student models across subsequences within time series data, ensuring that the student model effectively mimics the teacher model's behaviour. This approach facilitates the development of compact, high-performing recurrent neural networks suitable for real-time, time series analysis tasks. Our extensive experiments demonstrate that MemKD significantly outperforms state-of-the-art KD methods. It reduces parameter size and memory usage by approximately 500 times while maintaining comparable performance to the teacher model.
翻译:深度学习模型,特别是循环神经网络及其变体(如长短期记忆网络),极大地推动了时间序列数据分析的发展。这些模型能够捕捉时间序列中复杂的顺序模式,从而实现实时评估。然而,其高计算复杂性和大模型规模对在资源受限环境(如可穿戴设备和边缘计算平台)中的部署提出了挑战。知识蒸馏通过将知识从大型复杂模型(教师模型)迁移到更小、更高效的模型(学生模型),为解决此问题提供了方案,从而在降低计算需求的同时保持高性能。现有的知识蒸馏方法最初为计算机视觉任务设计,忽略了时间序列模型独特的时间依赖性和记忆保持特性。为此,我们提出了一种新颖的知识蒸馏框架,称为记忆差异知识蒸馏。MemKD利用专门的损失函数来捕捉教师模型与学生模型在时间序列数据子序列间的记忆保持差异,确保学生模型有效模仿教师模型的行为。该方法有助于开发适用于实时时间序列分析任务的紧凑型高性能循环神经网络。我们的大量实验表明,MemKD显著优于当前最先进的知识蒸馏方法。在保持与教师模型相当性能的同时,它将参数规模和内存使用量减少了约500倍。