μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.

翻译：具有张量核心的异构可重构平台（如AMD ACAP）因其高吞吐量和灵活性而日益广泛用于深度神经网络推理。然而，其在微秒级小规模问题推理中的适用性仍待充分探索。在高能物理的喷注标记应用中，低效的片上通信和过大的层间延迟导致现有框架无法满足1微秒的延迟预算。此外，同步开销及VLIW处理器序言等硬件开销常被忽略，使得无法正确优化加速器。针对这些问题，我们提出μ-ORCA——一种定制化的超低延迟异构加速器框架。μ-ORCA在AIE阵列的DNN层间实现直接层间通信，取代了共享存储片或FPGA结构。同时采用512位/周期的级联连接替代32位/周期的DMA连接。μ-ORCA还提供适应不同神经网络层尺寸的开销感知性能模型，并通过设计空间探索优化端到端延迟。该框架支持包含偏置、ReLU和全局聚合等非矩阵乘法核的MLP及DeepSets模型在AIE上运行。我们在AMD ACAP VEK280平台上评估μ-ORCA。实验结果表明，与多种最新ACAP框架相比，μ-ORCA平均延迟降低超过1.70倍和1.83倍，并能在6层实际DeepSets模型上实现0.93微秒延迟，满足延迟预算。我们已在https://github.com/arc-research-lab/u-ORCA 开源μ-ORCA。