AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

Deep reinforcement learning has demonstrated remarkable success across various domains. However, the tight coupling between training and inference processes makes accelerating DRL training an essential challenge for DRL optimization. Two key issues hinder efficient DRL training: (1) the significant variation in computational intensity across different DRL algorithms and even among operations within the same algorithm complicates hardware platform selection, while (2) DRL's wide dynamic range could lead to substantial reward errors with conventional FP16+FP32 mixed-precision quantization. While existing work has primarily focused on accelerating DRL for specific computing units or optimizing inference-stage quantization, we propose AP-DRL to address the above challenges. AP-DRL is an automatic task partitioning framework that harnesses the heterogeneous architecture of AMD Versal ACAP (integrating CPUs, FPGAs, and AI Engines) to accelerate DRL training through intelligent hardware-aware optimization. Our approach begins with bottleneck analysis of CPU, FPGA, and AIE performance across diverse DRL workloads, informing the design principles for AP-DRL's inter-component task partitioning and quantization optimization. The framework then addresses the challenge of platform selection through design space exploration-based profiling and ILP-based partitioning models that match operations to optimal computing units based on their computational characteristics. For the quantization challenge, AP-DRL employs a hardware-aware algorithm coordinating FP32 (CPU), FP16 (FPGA/DSP), and BF16 (AI Engine) operations by leveraging Versal ACAP's native support for these precision formats. Comprehensive experiments indicate that AP-DRL can achieve speedup of up to 4.17$\times$ over programmable logic and up to 3.82$\times$ over AI Engine baselines while maintaining training convergence.

翻译：深度强化学习已在多个领域展现出显著成功。然而，训练与推理过程之间的紧密耦合使得加速DRL训练成为其优化的核心挑战。两个关键问题阻碍了高效的DRL训练：（1）不同DRL算法之间乃至同一算法内部操作的计算强度存在显著差异，这使硬件平台选择复杂化；（2）DRL的宽动态范围特性可能导致传统FP16+FP32混合精度量化产生严重的奖励误差。现有研究主要集中于针对特定计算单元加速DRL或优化推理阶段量化，我们提出的AP-DRL旨在应对上述挑战。AP-DRL是一个自动任务划分框架，利用AMD Versal ACAP（集成CPU、FPGA和AI Engine）的异构架构，通过智能硬件感知优化来加速DRL训练。我们的方法首先对CPU、FPGA和AIE在不同DRL工作负载下的性能进行瓶颈分析，为AP-DRL的组件间任务划分和量化优化设计提供指导原则。随后，该框架通过基于设计空间探索的分析建模和基于ILP的划分模型，将操作按其计算特征匹配至最优计算单元，从而解决平台选择难题。针对量化挑战，AP-DRL采用硬件感知算法，利用Versal ACAP对FP32（CPU）、FP16（FPGA/DSP）和BF16（AI Engine）精度格式的原生支持，协调这三种精度的运算。综合实验表明，AP-DRL在可编程逻辑基线上可实现最高4.17倍的加速比，在AI Engine基线上可实现最高3.82倍的加速比，同时保持训练收敛性。