Multi Scale Deformable Attention (MSDAttn) has become a fundamental component in various vision tasks due to its effective multi scale grid sampling (MSGS). However, its reliance on random sampling results in highly irregular memory access patterns, making it a memory intensive operation inefficient for GPUs. Near memory processing (NMP) offers a promising solution for accelerating memory bound kernels, yet existing NMP based attention accelerators remain suboptimal for MSDAttn due to incompatible load balancing and data reuse strategies. Specifically, current NMP solutions uniformly distribute processing elements (PEs) across all banks, leading to significant PE underutilization and excessive cross bank data transfers. Moreover, most rely on locality based reuse, which fails under MSDAttn's unpredictable sampling patterns. To address these challenges, this paper presents DANMP, a hardware software co designed NMP based MSDAttn accelerator. On the hardware side, DANMP adopts non uniform NMP integration to handle unbalanced workloads, allocating PEs only in select banks for hot entries, while cold data are processed at the bank group level reducing PE idleness and cross bank transfers. On the software side, it introduces a clustering and packing (CAP) method that leverages clustering to improve temporal locality in query processing, enhancing data reuse. Finally, we implement host NMP co optimization techniques, including an optimized programming model, customized instructions, and a tailored dataflow. Experiments on object detection inference show that DANMP achieves 97.43x speedup and 208.47x energy efficiency improvement over NVIDIA A6000 GPU.
翻译:多尺度可变形注意力(MSDAttn)凭借其有效的多尺度网格采样(MSGS)机制,已成为各类视觉任务的基础组件。然而,其依赖随机采样的特性导致高度不规则的内存访问模式,使其成为对GPU而言效率低下的内存密集型操作。近内存处理(NMP)为加速内存受限的核心计算提供了前景广阔的解决方案,但现有的基于NMP的注意力加速器因负载均衡与数据重用策略不兼容,对MSDAttn仍非最优方案。具体而言,当前NMP解决方案将处理单元(PE)均匀分配至所有存储体,导致显著的PE利用率不足及过量的跨存储体数据传输。此外,多数方案依赖基于局部性的数据重用策略,这在MSDAttn不可预测的采样模式下难以生效。为应对这些挑战,本文提出DANMP——一种软硬件协同设计的基于NMP的MSDAttn加速器。在硬件层面,DANMP采用非均匀NMP集成以处理不均衡工作负载,仅将PE分配至热点数据所在的选定存储体,而冷数据则在存储体组级别进行处理,从而减少PE闲置与跨存储体传输。在软件层面,本文提出一种聚类打包(CAP)方法,通过聚类提升查询处理中的时间局部性以增强数据重用。最后,我们实现了主机-NMP协同优化技术,包括优化的编程模型、定制指令集及量身定制的数据流。目标检测推理实验表明,相较于NVIDIA A6000 GPU,DANMP实现了97.43倍的加速比与208.47倍的能效提升。