We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.
翻译:我们提出了一种基于强化学习的编译器,该编译器在3纳米至28纳米工艺节点上,联合优化用于人工智能推理的专用集成电路架构、存储层次结构和工作负载划分。设计空间被形式化为一个单一马尔可夫决策过程,包含混合离散-连续动作和统一的功耗-性能-面积目标。采用混合专家门控的柔性演员-评论家算法探索了网格拓扑、每核微架构和算子放置的联合空间。我们在两个工作负载上进行了验证——Llama 3.1 8B FP16(高性能模式,在3纳米下每秒29809个令牌)和SmolVLM(低功耗模式,在所有节点上功耗低于13毫瓦,10兆赫兹)。跨7个工艺节点,强化学习自动调整网格尺寸和每块配置,包括异构FETCH、VLEN和内存分配,无需针对特定节点的手动重新调优。