The AIPC concept is gaining popularity, and more and more hybrid CPUs will be running AI models on client devices. However, the current AI inference framework overlooks the imbalanced hardware capability of hybrid CPUs, leading to low inference performance. To address this issue, we have introduced a dynamic parallel method for hybrid CPUs, which significantly increases LLM inference performance by balancing the workload for each core of a hybrid CPU before the parallel work starts. This method has enabled Neural Speed to achieve more than 90% (on average) of memory bandwidth on two hybrid Intel CPUs.
翻译:AIPC概念日益普及,越来越多的混合CPU将在客户端设备上运行AI模型。然而,当前的AI推理框架忽视了混合CPU硬件能力的不均衡性,导致推理性能低下。为解决这一问题,我们提出了一种面向混合CPU的动态并行方法,通过在并行工作开始前平衡混合CPU各核心的工作负载,显著提升了LLM推理性能。该方法使Neural Speed在两款英特尔混合CPU上实现了平均超过90%的内存带宽利用率。