Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
翻译:尽管现有面向CPU的大语言模型推理框架已较为成熟,但它们未能充分挖掘众核CPU平台的计算潜力。众核CPU广泛部署于网络服务器和高端网络设备中,通常组织为多个NUMA节点,每个节点聚合了核心与内存。现有框架大多忽视了跨NUMA内存访问的巨大开销,限制了此类平台上的推理可扩展性与智能赋能能力。为突破这一局限,我们构建了ArcLight——一个专为众核CPU从头设计的轻量级大语言模型推理架构。ArcLight集成了高效的内存管理与线程调度机制,并引入精细控制的张量并行策略以缓解跨节点内存访问壁垒。实验结果表明,ArcLight显著超越了主流框架的性能上限,推理吞吐量最高可提升46%。此外,ArcLight保持了对任意CPU设备的兼容性。ArcLight已在https://github.com/OpenBMB/ArcLight开源发布。