This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resource-efficient approach further enables up to 4x throughput improvement within the same hardware resources by leveraging vectorised, time-multiplexed execution and flexible precision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports flexible precision (4/8/16-bit) and high MAC density. The ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration that achieves higher compute density (4.83 TOPS/mm2 ) and energy efficiency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardware-software co-design methodology for object detection and classification tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efficient solution for edge AI applications.
翻译:本简报提出了一种运行时自适应、性能增强的向量引擎,其采用基于低资源迭代CORDIC的乘累加单元,用于边缘AI加速。所提出的设计支持在近似模式与精确模式之间动态重构,从而针对广泛的工作负载利用延迟-精度权衡关系。其资源高效的方法通过利用向量化时分复用执行与灵活精度缩放,进一步实现了在相同硬件资源下高达4倍的吞吐量提升。凭借一个时分复用多激活函数块以及一个轻量级池化与归一化单元,该向量引擎支持灵活精度(4/8/16位)与高乘累加密度。ASIC实现结果表明,每个乘累加级可节省高达33%的时间与21%的功耗;其256个处理单元的配置实现了比先前先进方案更高的计算密度(4.83 TOPS/mm²)与能效(11.67 TOPS/W)。文中还讨论了在Pynq-Z2平台上进行目标检测与分类任务的详细软硬件协同设计方法,以评估所提出的架构,展示了一种可扩展、高能效的边缘AI应用解决方案。