This work presents TREA, a low-precision time-multiplexed and resource-efficient edge-AI accelerator for object detection and classification, targeting stringent area-power-latency constraints of edge vision platforms. The proposed architecture integrates a dual-precision (4/8-bit) SIMD multiply-accumulate (DQ-MAC) unit based on most-significant-digit-first (MSDF) shift-and-add computation with run-time bit truncation, eliminating conventional multiplier overhead and reducing accumulator bit-width. The DQ-MAC supports 4x FxP4 or 1x FxP8 operations per cycle, achieving up to 4x throughput improvement without hardware duplication. A structured hardware-aware reductive pruning (SHARP) strategy is co-designed with the SIMD datapath, enabling near 50% structured sparsity while maintaining full MAC utilization. This allows a 3x3 convolution kernel to be computed in 1 cycle in FxP4 mode compared to 9 cycles in FxP8, and a 5x5 kernel in 3 cycles versus 25 cycles, yielding up to 9x latency reduction at the kernel level. The accelerator further incorporates a reconfigurable CORDIC-based nonlinear activation function (RQ-NAF) core with a 9-stage pipeline, supporting Sigmoid, Tanh, and ReLU at one output per cycle after pipeline fill, while enabling (N-1) hardware reuse through time-multiplexing. The complete TREA architecture employs a 1D array of 100 SIMD DQ-MAC units with layer-wise hardware reuse, significantly reducing area and control complexity. Experimental results demonstrate substantial improvements in latency, hardware utilization, and energy efficiency compared to conventional fixed-precision and non-reconfigurable accelerators, validating TREA as an effective solution for real-time edge vision workloads.
翻译:本文提出TREA——一种低精度时分复用且资源高效的边缘AI加速器,专为目标检测与分类任务设计,旨在满足边缘视觉平台对面积-功耗-延迟的严格约束。该架构集成基于最高有效位优先(MSDF)移位累加计算的双精度(4/8位)SIMD乘累加(DQ-MAC)单元,支持运行时位截断,消除了传统乘法器开销并缩减累加器位宽。DQ-MAC每周期支持4次FxP4或1次FxP8运算,无需硬件复制即可实现高达4倍的吞吐量提升。结合SIMD数据通路协同设计的结构化硬件感知剪枝(SHARP)策略,可在保持MAC单元完全利用率的同时实现近50%的结构化稀疏性。这使得3x3卷积核在FxP4模式下1个周期完成计算(FxP8模式下需9个周期),5x5卷积核仅需3个周期(对比25个周期),核级延迟最高降低9倍。加速器进一步集成基于可重构CORDIC的非线性激活函数(RQ-NAF)核,采用9级流水线架构,流水线填满后每周期可输出Sigmoid、Tanh和ReLU函数结果,并通过时分复用实现(N-1)倍硬件复用。完整TREA架构采用100个SIMD DQ-MAC单元的一维阵列,通过逐层硬件复用显著降低面积与控制复杂度。实验结果表明,与传统定精度及非可重构加速器相比,该方法在延迟、硬件利用率和能效方面均有显著提升,验证了TREA作为实时边缘视觉工作负载有效解决方案的可行性。