We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications focused on exploring distributing scheduling optimizations for Deep Learning (DL) workloads to obtain the best performance regarding latency and power efficiency. Our cluster was modular throughout the experiment, and we have implementations that consist of up to 12 Zynq-7020 chip-based boards as well as 5 UltraScale+ MPSoC FPGA boards connected through an ethernet switch, and the cluster will evaluate configurable Deep Learning Accelerator (DLA) Versatile Tensor Accelerator (VTA). This adaptable distributed architecture is distinguished by its capacity to evaluate and manage neural network workloads in numerous configurations which enables users to conduct multiple experiments tailored to their specific application needs. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the computation graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
翻译:我们提出了一种基于低功耗嵌入式FPGA的分布式系统,专为边缘计算应用设计,旨在探索针对深度学习工作负载的分布式调度优化,以在延迟和能效方面获得最佳性能。实验中的集群采用模块化设计,实现方案包括多达12块基于Zynq-7020芯片的板卡以及5块UltraScale+ MPSoC FPGA板卡,通过以太网交换机互连。该集群将评估可配置深度学习加速器——通用张量加速器(VTA)。这种自适应分布式架构的独特之处在于,它能够在多种配置下评估和管理神经网络工作负载,使用户能够根据特定应用需求开展多样化实验。所提出的系统可同时运行多种神经网络模型,以流水线结构编排计算图,并将更多资源手动分配给神经网络图中计算最密集的层。