Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.
翻译:大模型推理既具有内存密集性又具有计算密集性,传统上依赖GPU。然而,GPU供应有限且成本高昂,这促使人们采用高性能通用CPU,特别是新兴的3D堆叠静态非均匀缓存架构系统。这些架构提供了更高的带宽和局部性,但由于3D片上网络的引入,面临着严重的热挑战以及不均匀的缓存延迟。受大模型内核多样性及系统异质性影响,线程迁移和电压/频率缩放的最优管理并非易事。现有热管理方法往往依赖过度简化的分析模型且缺乏适应性。我们提出AILFM——一种基于主动模仿学习的调度框架,该框架通过极低的运行时开销从Oracle演示中学习近乎最优的热感知调度策略。AILFM兼顾大模型中的核心级性能异质性和内核特定行为,在最大化性能的同时维持热安全性。大量实验表明,AILFM优于现有先进基线方法,并能良好泛化至多样化的大模型工作负载。