The emerging trend of deploying complex algorithms, such as Deep Neural Networks (DNNs), increasingly poses strict memory and energy efficiency requirements on Internet-of-Things (IoT) end-nodes. Mixed-precision quantization has been proposed as a technique to minimize a DNN's memory footprint and maximize its execution efficiency, with negligible end-to-end precision degradation. In this work, we present a novel hardware and software stack for energy-efficient inference of mixed-precision Quantized Neural Networks (QNNs). We introduce Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA) that features fused Mac&Load mixed-precision dot product instructions; to avoid the exponential growth of the encoding space due to mixed-precision variants, we encode formats into the Control-Status Registers (CSRs). Flex-V core is integrated into a tightly-coupled cluster of eight processors; in addition, we provide a full framework for the end-to-end deployment of DNNs including a compiler, optimized libraries, and a memory-aware deployment flow. Our results show up to 91.5 MAC/cycle and 3.26 TOPS/W on the cluster, implemented in a commercial 22nm FDX technology, with up to 8.5x speed-up, and an area overhead of only 5.6% with respect to the baseline. To demonstrate the capabilities of the architecture, we benchmark it with end-to-end real-life QNNs, improving performance by 2x - 2.5x with respect to existing solutions using fully flexible programmable processors.
翻译:随着深度神经网络等复杂算法的部署趋势日益兴起,物联网终端节点面临着愈发严格的存储与能效要求。混合精度量化技术被提出用于最小化深度神经网络的内存占用并最大化其执行效率,同时仅带来可忽略的端到端精度损失。本文提出了一种面向混合精度量化神经网络能效推理的新型软硬件栈。我们设计了基于RISC-V指令集架构的Flex-V处理器,该处理器支持融合型乘累加与加载混合精度点积指令;为避免因混合精度变体导致编码空间指数级膨胀,我们将格式编码于控制状态寄存器中。Flex-V核心集成于八处理器紧耦合集群内,并配套提供了完整的深度神经网络端到端部署框架,包含编译器、优化库及内存感知部署流程。在商用22nm FDX工艺上实现的集群结果表明,其性能可达91.5 MAC/周期与3.26 TOPS/W,较基线方案实现最高8.5倍加速,面积开销仅增加5.6%。为验证架构能力,我们采用端到端实际量化神经网络进行基准测试,相较于采用全灵活可编程处理器的现有方案,性能提升2至2.5倍。