AI model architectures are diversifying rapidly. Although dense matrix multiplication underlies today's CNNs and transformers, emerging architectures (state-space models, long convolutions via the fast Fourier transform (FFT), Kolmogorov-Arnold networks, and spiking networks) are not multiply-accumulate (MAC) dominated; they spend much of their computation on vector and non-MAC primitives that homogeneous, MAC-centric neural processing units (NPUs) serve poorly. This has motivated heterogeneous NPUs (HPUs) built from non-identical tiles. Prior heterogeneous designs vary only one or two coarse knobs (typically MAC precision or array size) and are evaluated on narrow workloads; no existing framework supports fine-grained HPU design, where tiles differ across many architectural dimensions at once. We present MOSAIC, an analytical simulator and design-space-exploration (DSE) framework for HPU microarchitecture design. MOSAIC searches the joint space of tile-level heterogeneity: beyond array size and precision, it varies tile-type composition (large Big, small Little, and non-MAC Special-Function tiles), dataflow, sparsity mode, MAC engine type, and special-function units for non-MAC operators (FFT, spiking-integrate, polynomial). Unlike prior simulators that model a single homogeneous tile type, MOSAIC models non-MAC tiles with their own energy, area, and timing models and maps operators across a mix of tiles with a heterogeneity-aware compiler. A multi-seed pipeline pairing a stratified sweep with genetic-algorithm refinement returns Pareto-optimal designs, with cost models calibrated to a 7 nm node and cross-validated against NVIDIA's Deep Learning Accelerator (NVDLA). Across a 20-workload suite, the best general-purpose HPU found by MOSAIC (~200 mm^2 Big+Little+Special-Function) achieves +46.91% mean iso-area energy savings over the best iso-area homogeneous baseline.
翻译:AI模型架构正快速多样化。尽管稠密矩阵乘法仍是当今CNN和Transformer的基础,但新兴架构(状态空间模型、基于快速傅里叶变换的长卷积、科尔莫戈罗夫-阿诺德网络及脉冲网络)并非以乘累加运算为主导,其大量计算消耗在向量运算及非MAC原语上,而同构的MAC中心型神经网络处理器难以高效处理此类运算。这促使人们构建由非相同瓦片组成的异构NPU。先前的异构设计仅调节一两个粗粒度参数(通常为MAC精度或阵列尺寸),且仅在狭窄工作负载下评估;尚无现有框架支持瓦片在多个架构维度上同时差异化的细粒度HPU设计。我们提出MOSAIC——一个面向HPU微架构设计的分析型模拟器与设计空间探索框架。MOSAIC在瓦片级异构性的联合空间中搜索:除阵列尺寸和精度外,它还可变瓦片类型组成(大型Big瓦片、小型Little瓦片及非MAC专用功能瓦片)、数据流、稀疏模式、MAC引擎类型及非MAC算子专用功能单元。与仅建模单一同构瓦片类型的先前模拟器不同,MOSAIC为拥有自身能量、面积和时序模型的非MAC瓦片建模,并通过异构感知编译器将算子映射到多种瓦片组合上。采用分层扫描与遗传算法精炼相结合的多种子流水线,返回帕累托最优设计,其成本模型经7纳米工艺节点校准,并通过NVIDIA深度学习加速器进行交叉验证。在包含20个工作负载的测试套件中,MOSAIC找到的最佳通用HPU在同面积条件下,相比最优同构基线实现46.91%的平均能量节省。