Edge-AI applications still face considerable challenges in enhancing computational efficiency in resource-constrained environments. This work presents RAMAN, a resource-efficient and approximate posit(8,2)-based Multiply-Accumulate (MAC) architecture designed to improve hardware efficiency within bandwidth limitations. The proposed REAP (Resource-Efficient Approximate Posit) MAC engine, which is at the core of RAMAN, uses approximation in the posit multiplier to achieve significant area and power reductions with an impact on accuracy. To support diverse AI workloads, this MAC unit is incorporated in a scalable Vector Execution Unit (VEU), which permits hardware reuse and parallelism among deep neural network layers. Furthermore, we propose an algorithm-hardware co-design framework incorporating approximation-aware training to evaluate the impact of hardware-level approximation on application-level performance. Empirical validation on FPGA and ASIC platforms shows that the proposed REAP MAC achieves up to 46% in LUT savings and 35.66% area, 31.28% power reduction, respectively, over the baseline Posit Dot-Product Unit (PDPU) design, while maintaining high accuracy (98.45%) for handwritten digit recognition. RAMAN demonstrates a promising trade-off between hardware efficiency and learning performance, making it suitable for next-generation edge intelligence.
翻译:边缘AI应用在资源受限环境中提升计算效率仍面临显著挑战。本文提出RAMAN——一种基于posit(8,2)的资源高效近似乘累加(MAC)架构,旨在带宽限制下提升硬件效率。作为RAMAN核心的REAP(资源高效近似Posit)MAC引擎,通过在posit乘法器中采用近似计算,以精度为代价实现了显著的面积与功耗降低。为支持多样化AI工作负载,该MAC单元被集成至可扩展向量执行单元(VEU)中,支持硬件复用及深度神经网络层间并行处理。此外,我们提出融合近似感知训练的算法-硬件协同设计框架,用以评估硬件级近似对应用层性能的影响。在FPGA与ASIC平台上的实验验证表明,相较于基准Posit点积单元(PDPU)设计,所提出的REAP MAC在保持手写数字识别高精度(98.45%)的同时,最高可实现46%的LUT节省,以及分别达35.66%的面积缩减与31.28%的功耗降低。RAMAN在硬件效率与学习性能间展现出优越的权衡特性,适用于新一代边缘智能系统。