This paper presents CARMEN, a runtime-adaptive, CORDIC-accelerated multi-precision vector engine for resource-efficient deep learning inference. The key insight is that CORDIC iteration depth directly governs computational accuracy, enabling dynamic switching between approximate and accurate execution modes without hardware modification. The architecture integrates a low-resource iterative CORDIC-based MAC unit with a time-multiplexed multi-activation function block, supporting flexible 8/16-bit precision and high hardware utilization. ASIC implementation in 28 nm CMOS achieves up to 33% reduction in computation cycles and 21% power savings per MAC stage; a 256-PE configuration delivers 4.83 TOPS/mm2 compute density and 11.67 TOPS/W energy efficiency. FPGA deployment on PynqZ2 validates 154.6 ms latency at 0.43 W for real-time object detection.
翻译:本文提出CARMEN,一种运行时自适应、基于CORDIC加速的多精度向量引擎,用于资源高效的深度学习推理。其核心思想在于CORDIC迭代深度直接决定计算精度,从而无需修改硬件即可在近似与精确执行模式间动态切换。该架构将低资源迭代式CORDIC乘累加单元与时域复用多激活函数模块相结合,支持灵活的8/16位精度并实现高硬件利用率。基于28 nm CMOS的ASIC实现,每个乘累加级计算周期最多减少33%,功耗降低21%;256个处理单元配置下计算密度达4.83 TOPS/mm²,能效达11.67 TOPS/W。在PynqZ2上的FPGA部署验证了实时目标检测延迟为154.6 ms,功耗仅0.43 W。