Recurrent Neural Networks (RNNs) are vital for sequential data processing. Long Short-Term Memory Autoencoders (LSTM-AEs) are particularly effective for unsupervised anomaly detection in time-series data. However, inherent sequential dependencies limit parallel computation. While previous work has explored FPGA-based acceleration for LSTM networks, efforts have typically focused on optimizing a single LSTM layer at a time. We introduce a novel FPGA-based accelerator using a dataflow architecture that exploits temporal parallelism for concurrent multi-layer processing of different timesteps within sequences. Experimental evaluations on four representative LSTM-AE models with varying widths and depths, implemented on a Zynq UltraScale+ MPSoC FPGA, demonstrate significant advantages over CPU (Intel Xeon Gold 5218R) and GPU (NVIDIA V100) implementations. Our accelerator achieves latency speedups up to 79.6x vs. CPU and 18.2x vs. GPU, alongside energy-per-timestep reductions of up to 1722x vs. CPU and 59.3x vs. GPU. These results, including superior network depth scalability, highlight our approach's potential for high-performance, real-time, power-efficient LSTM-AE-based anomaly detection on FPGAs.
翻译:循环神经网络(RNNs)对于序列数据处理至关重要。长短期记忆自编码器(LSTM-AEs)在时间序列数据的无监督异常检测方面尤为有效。然而,其固有的序列依赖性限制了并行计算。虽然先前的工作已探索了基于FPGA的LSTM网络加速,但这些努力通常集中于一次优化单个LSTM层。我们提出了一种新颖的基于FPGA的加速器,该加速器采用数据流架构,利用时间并行性对序列内不同时间步进行并发多层处理。在Zynq UltraScale+ MPSoC FPGA上实现四种具有不同宽度和深度的代表性LSTM-AE模型,并进行实验评估,结果表明其相较于CPU(Intel Xeon Gold 5218R)和GPU(NVIDIA V100)实现具有显著优势。我们的加速器实现了高达79.6倍(相对于CPU)和18.2倍(相对于GPU)的延迟加速,同时每个时间步的能耗降低高达1722倍(相对于CPU)和59.3倍(相对于GPU)。这些结果,包括优异的网络深度可扩展性,突显了我们所提方法在FPGA上实现高性能、实时、高能效的基于LSTM-AE的异常检测方面的潜力。