μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.

翻译：异构可重构平台（如AMD ACAP）凭借其高吞吐量与灵活性，正被日益广泛地应用于深度神经网络（DNN）推理。然而，这类平台在小规模问题下能否满足微秒级推理需求仍缺乏探索。在高能物理的喷注标记应用中，现有框架受限于低效的片上通信与过大的层间延迟，无法满足1微秒的延迟预算。此外，同步操作与VLIW处理器序言等硬件开销常被忽视，导致无法正确优化加速器。针对上述问题，本文提出μ-ORCA——一种面向超低延迟模型推理的定制化异构加速器框架。μ-ORCA支持DNN各层在AIE阵列间直接通信，无需借助共享存储块或FPGA逻辑；同时采用512位/周期的级联连接替代32位/周期的DMA连接。框架还提供一种开销感知性能模型，可适配不同神经网络层尺寸，并通过设计空间探索优化端到端延迟。μ-ORCA支持含偏置、ReLU及全局聚合等非矩阵乘核的MLP与DeepSets模型，并在AMD ACAP VEK280平台上进行评测。实验结果表明，相比不同先进ACAP框架，μ-ORCA平均延迟降低超过1.70倍与1.83倍，且针对6层真实DeepSets模型实现0.93微秒延迟，满足延迟预算。μ-ORCA已开源至https://github.com/arc-research-lab/u-ORCA。