Analog In-Memory Computing (AIMC) accelerators execute matrix-vector multiplications directly within memory arrays, reducing data movement and improving DNN inference efficiency. Their limited effective precision motivates heterogeneous architectures that combine analog compute tiles with digital processing units. This letter classifies existing methods for partitioning DNN workloads across these resources by mapping granularity, optimization strategy, and model support, and distills them into a unified four-stage workflow. To demonstrate the workflow on a model class not yet addressed by existing methods, we apply its first two stages to GPT-2, producing the first AIMC-specific precision sensitivity profile for a decoder-only transformer. Sensitivity is dominated by 4 of 49 projections, with the first decoder block's attention output dominating by an order of magnitude. This suggests that projection-level mapping and selective digital execution of early-block and output-facing projections are important for reliable decoder-transformer deployment on AIMC hardware.
翻译:模拟内存计算(AIMC)加速器在存储阵列内部直接执行矩阵向量乘法,从而减少数据移动并提升深度神经网络推理效率。由于其有限的等效精度,促使了将模拟计算模块与数字处理单元相结合的异构架构。本文根据映射粒度、优化策略及模型支持方式,对现有的将深度神经网络工作负载划分至这些资源的方法进行了分类,并将其提炼为一个统一的四阶段工作流。为在现有方法尚未涉及的模型类别上演示该工作流,我们将其前两个阶段应用于GPT-2,首次生成了面向仅解码器Transformer的AIMC专用精度敏感性分布图。敏感性主要由49个投影中的4个主导,其中首个解码器模块的注意力输出贡献量超出一个数量级。这表明,对于在AIMC硬件上可靠部署解码器Transformer而言,投影级映射以及早期模块和面向输出的投影的选择性数字执行至关重要。